summaryrefslogtreecommitdiff
path: root/Documentation/design-docs
diff options
context:
space:
mode:
authorAndy Ayers <andya@microsoft.com>2016-11-30 18:16:17 -0800
committerAndy Ayers <andya@microsoft.com>2017-01-10 17:13:57 -0800
commitf2a2d9e1bd210b463fa929bab786e9b3529853f8 (patch)
treef58e585a2c0a5d6a97740fd4efd7877d7da12bce /Documentation/design-docs
parentec54d18ab8102305f1aad7a1a77da67637ee13f0 (diff)
downloadcoreclr-f2a2d9e1bd210b463fa929bab786e9b3529853f8.tar.gz
coreclr-f2a2d9e1bd210b463fa929bab786e9b3529853f8.tar.bz2
coreclr-f2a2d9e1bd210b463fa929bab786e9b3529853f8.zip
JIT: Finally Optimizations
Adds two optimization for try-finallys: empty finally removal and finally cloning. Empty finally removal identifies trivially empty finally clauses and removes the entire try-finally EH region. COde in the try is "promoted" to be in the parent EH region (or method region). Empty finallys often appear after inlining empty Dispose methods. Removing a try-finally with an empty finally both reduces code size and improves code speed. Finally cloning duplicates the code for the finally and 'inlines' it along one of the normal exit paths from the try. This improves code speed in the typical case where there is no exception raised while the try is active. It generally increases code size slightly. However, finallys are rare enough that the overall code size increase across all methods is quite small. The jit will clone most finallys, provided they are not too large, and are not contained in or contain other EH constructs. If a try contains multiple exit paths only the final "fall through" path will be optimized. These optimizations are enabled for all target architectures. Finally cloning is currently disabled for desktop CLR because more work is needed to support thread abort. More details on both optimizations can be found in the design document added as part of this commit. In debug builds, finally cloning can be selectively disabled or enabled by setting COMPlus_JitEnableFinallyCloning to 0 or 1 respectively. This config setting can thus be used override the default behavior (cloning enabled for CoreCLR, disabled otherwise) for diagnostic or testing purposes. Closes #1505. Closes #8065.
Diffstat (limited to 'Documentation/design-docs')
-rw-r--r--Documentation/design-docs/finally-optimizations.md451
1 files changed, 451 insertions, 0 deletions
diff --git a/Documentation/design-docs/finally-optimizations.md b/Documentation/design-docs/finally-optimizations.md
new file mode 100644
index 0000000000..ddc6dbd93c
--- /dev/null
+++ b/Documentation/design-docs/finally-optimizations.md
@@ -0,0 +1,451 @@
+Finally Optimizations
+=====================
+
+In MSIL, a try-finally is a construct where a block of code
+(the finally) is guaranteed to be executed after control leaves a
+protected region of code (the try) either normally or via an
+exception.
+
+In RyuJit a try-finally is currently implemented by transforming the
+finally into a local function that is invoked via jitted code at normal
+exits from the try block and is invoked via the runtime for exceptional
+exits from the try block.
+
+For x86 the local function is simply a part of the method and shares
+the same stack frame with the method. For other architectures the
+local function is promoted to a potentially separable "funclet"
+which is almost like a regular function with a prolog and epilog. A
+custom calling convention gives the funclet access to the parent stack
+frame.
+
+In this proposal we outline two optimizations for finallys: removing
+empty finallys and finally cloning.
+
+Empty Finally Removal
+---------------------
+
+An empty finally is one that has no observable effect. These often
+arise from `foreach` or `using` constructs (which induce a
+try-finally) where the cleanup method called in the finally does
+nothing. Often, after inlining, the empty finally is readily apparent.
+
+For example, this snippet of C# code
+```C#
+static int Sum(List<int> x) {
+ int sum = 0;
+ foreach(int i in x) {
+ sum += i;
+ }
+ return sum;
+}
+```
+produces the following jitted code:
+```asm
+; Successfully inlined Enumerator[Int32][System.Int32]:Dispose():this
+; (1 IL bytes) (depth 1) [below ALWAYS_INLINE size]
+G_M60484_IG01:
+ 55 push rbp
+ 57 push rdi
+ 56 push rsi
+ 4883EC50 sub rsp, 80
+ 488D6C2460 lea rbp, [rsp+60H]
+ 488BF1 mov rsi, rcx
+ 488D7DD0 lea rdi, [rbp-30H]
+ B906000000 mov ecx, 6
+ 33C0 xor rax, rax
+ F3AB rep stosd
+ 488BCE mov rcx, rsi
+ 488965C0 mov qword ptr [rbp-40H], rsp
+
+G_M60484_IG02:
+ 33C0 xor eax, eax
+ 8945EC mov dword ptr [rbp-14H], eax
+ 8B01 mov eax, dword ptr [rcx]
+ 8B411C mov eax, dword ptr [rcx+28]
+ 33D2 xor edx, edx
+ 48894DD0 mov gword ptr [rbp-30H], rcx
+ 8955D8 mov dword ptr [rbp-28H], edx
+ 8945DC mov dword ptr [rbp-24H], eax
+ 8955E0 mov dword ptr [rbp-20H], edx
+
+G_M60484_IG03:
+ 488D4DD0 lea rcx, bword ptr [rbp-30H]
+ E89B35665B call Enumerator[Int32][System.Int32]:MoveNext():bool:this
+ 85C0 test eax, eax
+ 7418 je SHORT G_M60484_IG05
+
+; Body of foreach loop
+
+G_M60484_IG04:
+ 8B4DE0 mov ecx, dword ptr [rbp-20H]
+ 8B45EC mov eax, dword ptr [rbp-14H]
+ 03C1 add eax, ecx
+ 8945EC mov dword ptr [rbp-14H], eax
+ 488D4DD0 lea rcx, bword ptr [rbp-30H]
+ E88335665B call Enumerator[Int32][System.Int32]:MoveNext():bool:this
+ 85C0 test eax, eax
+ 75E8 jne SHORT G_M60484_IG04
+
+; Normal exit from the implicit try region created by `foreach`
+; Calls the finally to dispose of the iterator
+
+G_M60484_IG05:
+ 488BCC mov rcx, rsp
+ E80C000000 call G_M60484_IG09 // call to finally
+
+G_M60484_IG06:
+ 90 nop
+
+G_M60484_IG07:
+ 8B45EC mov eax, dword ptr [rbp-14H]
+
+G_M60484_IG08:
+ 488D65F0 lea rsp, [rbp-10H]
+ 5E pop rsi
+ 5F pop rdi
+ 5D pop rbp
+ C3 ret
+
+; Finally funclet. Note it simply sets up and then tears down a stack
+; frame. The dispose method was inlined and is empty.
+
+G_M60484_IG09:
+ 55 push rbp
+ 57 push rdi
+ 56 push rsi
+ 4883EC30 sub rsp, 48
+ 488B6920 mov rbp, qword ptr [rcx+32]
+ 48896C2420 mov qword ptr [rsp+20H], rbp
+ 488D6D60 lea rbp, [rbp+60H]
+
+G_M60484_IG10:
+ 4883C430 add rsp, 48
+ 5E pop rsi
+ 5F pop rdi
+ 5D pop rbp
+ C3 ret
+```
+
+In such cases the try-finally can be removed, leading to code like the following:
+```asm
+G_M60484_IG01:
+ 57 push rdi
+ 56 push rsi
+ 4883EC38 sub rsp, 56
+ 488BF1 mov rsi, rcx
+ 488D7C2420 lea rdi, [rsp+20H]
+ B906000000 mov ecx, 6
+ 33C0 xor rax, rax
+ F3AB rep stosd
+ 488BCE mov rcx, rsi
+
+G_M60484_IG02:
+ 33F6 xor esi, esi
+ 8B01 mov eax, dword ptr [rcx]
+ 8B411C mov eax, dword ptr [rcx+28]
+ 48894C2420 mov gword ptr [rsp+20H], rcx
+ 89742428 mov dword ptr [rsp+28H], esi
+ 8944242C mov dword ptr [rsp+2CH], eax
+ 89742430 mov dword ptr [rsp+30H], esi
+
+G_M60484_IG03:
+ 488D4C2420 lea rcx, bword ptr [rsp+20H]
+ E8A435685B call Enumerator[Int32][System.Int32]:MoveNext():bool:this
+ 85C0 test eax, eax
+ 7414 je SHORT G_M60484_IG05
+
+G_M60484_IG04:
+ 8B4C2430 mov ecx, dword ptr [rsp+30H]
+ 03F1 add esi, ecx
+ 488D4C2420 lea rcx, bword ptr [rsp+20H]
+ E89035685B call Enumerator[Int32][System.Int32]:MoveNext():bool:this
+ 85C0 test eax, eax
+ 75EC jne SHORT G_M60484_IG04
+
+G_M60484_IG05:
+ 8BC6 mov eax, esi
+
+G_M60484_IG06:
+ 4883C438 add rsp, 56
+ 5E pop rsi
+ 5F pop rdi
+ C3 ret
+```
+
+Empty finally removal is unconditionally profitable: it should always
+reduce code size and improve code speed.
+
+Finally Cloning
+---------------
+
+Finally cloning is an optimization where the jit duplicates the code
+in the finally for one or more of the normal exit paths from the try,
+and has those exit points branch to the duplicated code directly,
+rather than calling the finally. This transformation allows for
+improved performance and optimization of the common case where the try
+completes without an exception.
+
+Finally cloning also allows hot/cold splitting of finally bodies: the
+cloned finally code covers the normal try exit paths (the hot cases)
+and can be placed in the main method region, and the original finally,
+now used largely or exclusively for exceptional cases (the cold cases)
+spilt off into the cold code region. Without cloning, RyuJit
+would always treat the finally as cold code.
+
+Finally cloning will increase code size, though often the size
+increase is mitigated somewhat by more compact code generation in the
+try body and streamlined invocation of the cloned finallys.
+
+Try-finally regions may have multiple normal exit points. For example
+the following `try` has two: one at the `return 3` and one at the try
+region end:
+
+```C#
+try {
+ if (p) return 3;
+ ...
+}
+finally {
+ ...
+}
+return 4;
+```
+
+Here the finally must be executed no matter how the try exits. So
+there are to two normal exit paths from the try, both of which pass
+through the finally but which then diverge. The fact that some try
+regions can have multiple exits opens the potential for substantial
+code growth from finally cloning, and so leads to a choice point in
+the implementation:
+
+* Share the clone along all exit paths
+* Share the clone along some exit paths
+* Clone along all exit paths
+* Clone along some exit paths
+* Only clone along one exit path
+* Only clone when there is one exit path
+
+The shared clone option must essentially recreate or simulate the
+local call mechanism for the finally, though likely somewhat more
+efficiently. Each exit point must designate where control should
+resume once the shared finally has finished. For instance the jit
+could introduce a new local per try-finally to determine where the
+cloned finally should resume, and enumerate the possibilities using a
+small integer. The end of the cloned finally would then use a switch
+to determine what code to execute next. This has the downside of
+introducing unrealizable paths into the control flow graph.
+
+Cloning along all exit paths can potentially lead to large amounts of
+code growth.
+
+Cloning along some paths or only one path implies that some normal
+exit paths won't be as well optimized. Nonetheless cloning along one
+path was the choice made by JIT64 and the one we recommend for
+implementation. In particular we suggest only cloning along the end of
+try region exit path, so that any early exit will continue to invoke
+the funclet for finally cleanup (unless that exit happens to have the
+same post-finally continuation as the end try region exit, in which
+case it can simply jump to the cloned finally).
+
+One can imagine adaptive strategies. The size of the finally can
+be roughly estimated and the number of clones needed for full cloning
+readily computed. Selective cloning can be based on profile
+feedback or other similar mechanisms for choosing the profitable
+cases.
+
+The current implementation will clone the finally and retarget the
+last (largest IL offset) leave in the try region to the clone. Any
+other leave that ultimately transfers control to the same post-finally
+offset will also be modified to jump to the clone.
+
+Empirical studies have shown that most finallys are small. Thus to
+avoid excessive code growth, a crude size estimate is formed by
+counting the number of statements in the blocks that make up the
+finally. Any finally larger that 15 statements is not cloned. In our
+study this disqualifed about 0.5% of all finallys from cloning.
+
+### EH Nesting Considerations
+
+Finally cloning is also more complicated when the finally encloses
+other EH regions, since the clone will introduce copies of all these
+regions. While it is possible to implement cloning in such cases we
+propose to defer for now.
+
+Finally cloning is also a bit more complicated if the finally is
+enclosed by another finally region, so we likewise propose deferring
+support for this. (Seems like a rare enough thing but maybe not too
+hard to handle -- though possibly not worth it if we're not going to
+support the enclosing case).
+
+### Control-Flow and Other Considerations
+
+If the try never exits normally, then the finally can only be invoked
+in exceptional cases. There is no benefit to cloning since the cloned
+finally would be unreachable. We can detect a subset of such cases
+because there will be no call finally blocks.
+
+JIT64 does not clone finallys that contained switch. We propose to
+do likewise. (Initially I did not include this restriction but
+hit a failing test case where the finally contained a switch. Might be
+worth a deeper look, though such cases are presumably rare.)
+
+If the finally never exits normally, then we presume it is cold code,
+and so will not clone.
+
+If the finally is marked as run rarely, we will not clone.
+
+Implementation Proposal
+-----------------------
+
+We propose that empty finally removal and finally cloning be run back
+to back, spliced into the phase list just after fgInline and
+fgAddInternal, and just before implicit by-ref and struct
+promotion. We want to run these early before a lot of structural
+invariants regarding EH are put in place, and before most
+other optimization, but run them after inlining
+(so empty finallys can be more readily identified) and after the
+addition of implicit try-finallys created by the jit. Empty finallys
+may arise later because of optimization, but this seems relatively
+uncommon.
+
+We will remove empty finallys first, then clone.
+
+Neither optimization will run when the jit is generating debuggable
+code or operating in min opts mode.
+
+### Empty Finally Removal (Sketch)
+
+Skip over methods that have no EH, are compiled with min opts, or
+where the jit is generating debuggable code.
+
+Walk the handler table, looking for try-finally (we could also look
+for and remove try-faults with empty faults, but those are presumably
+rare).
+
+If the finally is a single block and contains only a `retfilter`
+statement, then:
+
+* Retarget the callfinally(s) to jump always to the continuation blocks.
+* Remove the paired jump always block(s) (note we expect all finally
+calls to be paired since the empty finally returns).
+* For funclet EH models with finally target bits, clear the finally
+target from the continuations.
+* For non-funclet EH models only, clear out the GT_END_LFIN statement
+in the finally continuations.
+* Remove the handler block.
+* Reparent all directly contained try blocks to the enclosing try region
+or to the method region if there is no enclosing try.
+* Remove the try-finally from the EH table via `fgRemoveEHTableEntry`.
+
+After the walk, if any empty finallys were removed, revalidate the
+integrity of the handler table.
+
+### Finally Cloning (Sketch)
+
+Skip over all methods, if the runtime suports thread abort. More on
+this below.
+
+Skip over methods that have no EH, are compiled with min opts, or
+where the jit is generating debuggable code.
+
+Walk the handler table, looking for try-finally. If the finally is
+enclosed in a handler or encloses another handler, skip.
+
+Walk the finally body blocks. If any is BBJ_SWITCH, or if none
+is BBJ_EHFINALLYRET, skip cloning. If all blocks are RunRarely
+skip cloning. If the finally has more that 15 statements, skip
+cloning.
+
+Walk the try region from back to front (from largest to smallest IL
+offset). Find the last block in the try that invokes the finally. That
+will be the path that will invoke the clone.
+
+If the EH model requires callfinally thunks, and there are multiple
+thunks that invoke the finally, and the callfinally thunk along the
+clone path is not the first, move it to the front (this helps avoid
+extra jumps).
+
+Set the insertion point to just after the callfinally in the path (for
+thunk models) or the end of the try (for non-thunk models). Set up a
+block map. Clone the finally body using `fgNewBBinRegion` and
+`fgNewBBafter` to make the first and subsequent blocks, and
+`CloneBlockState` to fill in the block contents. Clear the handler
+region on the cloned blocks. Bail out if cloning fails. Mark the first
+and last cloned blocks with appropriate BBF flags. Patch up inter-clone
+branches and convert the returns into jumps to the continuation.
+
+Walk the callfinallys, retargeting the ones that return to the
+continuation so that they invoke the clone. Remove the paired always
+blocks. Clear the finally target bit and any GT_END_LFIN from the
+continuation.
+
+If all call finallys are converted, modify the region to be try/fault
+(interally EH_HANDLER_FAULT_WAS_FINALLY, so we can distinguish it
+later from "organic" try/faults). Otherwise leave it as a
+try/finally.
+
+Clear the catch type on the clone entry.
+
+### Thread Abort
+
+For runtimes that support thread abort (desktop), more work is
+required:
+
+* The cloned finally must be reported to the runtime. Likely this
+can trigger off of the BBF_CLONED_FINALLY_BEGIN/END flags.
+* The jit must maintain the integrity of the clone by not losing
+track of the blocks involved, and not allowing code to move in our
+out of the cloned region
+
+Code Size Impact
+----------------
+
+Code size impact from finally cloning was measured for CoreCLR on
+Windows x64.
+
+```
+Total bytes of diff: 16158 (0.12 % of base)
+ diff is a regression.
+Total byte diff includes 0 bytes from reconciling methods
+ Base had 0 unique methods, 0 unique bytes
+ Diff had 0 unique methods, 0 unique bytes
+Top file regressions by size (bytes):
+ 3518 : Microsoft.CodeAnalysis.CSharp.dasm (0.16 % of base)
+ 1895 : System.Linq.Expressions.dasm (0.32 % of base)
+ 1626 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.07 % of base)
+ 1428 : System.Threading.Tasks.Parallel.dasm (4.66 % of base)
+ 1248 : System.Linq.Parallel.dasm (0.20 % of base)
+Top file improvements by size (bytes):
+ -4529 : System.Private.CoreLib.dasm (-0.14 % of base)
+ -975 : System.Reflection.Metadata.dasm (-0.28 % of base)
+ -239 : System.Private.Uri.dasm (-0.27 % of base)
+ -104 : System.Runtime.InteropServices.RuntimeInformation.dasm (-3.36 % of base)
+ -99 : System.Security.Cryptography.Encoding.dasm (-0.61 % of base)
+57 total files with size differences.
+Top method regessions by size (bytes):
+ 645 : System.Diagnostics.Process.dasm - System.Diagnostics.Process:StartCore(ref):bool:this
+ 454 : Microsoft.CSharp.dasm - Microsoft.CSharp.RuntimeBinder.Semantics.ExpressionBinder:AdjustCallArgumentsForParams(ref,ref,ref,ref,ref,byref):this
+ 447 : System.Threading.Tasks.Dataflow.dasm - System.Threading.Tasks.Dataflow.Internal.SpscTargetCore`1[__Canon][System.__Canon]:ProcessMessagesLoopCore():this
+ 421 : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.Symbols.ImplementsHelper:FindExplicitlyImplementedMember(ref,ref,ref,ref,ref,ref,byref):ref
+ 358 : System.Private.CoreLib.dasm - System.Threading.TimerQueueTimer:Change(int,int):bool:this
+Top method improvements by size (bytes):
+ -2512 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_CLRtoWinRT():ref:this (68 methods)
+ -824 : Microsoft.CodeAnalysis.dasm - Microsoft.Cci.PeWriter:WriteHeaders(ref,ref,ref,ref,byref):this
+ -663 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_CLRtoWinRT(ref):int:this (17 methods)
+ -627 : System.Private.CoreLib.dasm - System.Diagnostics.Tracing.ManifestBuilder:CreateManifestString():ref:this
+ -546 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_WinRTtoCLR(long):int:this (67 methods)
+3014 total methods with size differences.
+```
+
+The largest growth is seen in `Process:StartCore`, which has 4
+try-finally constructs.
+
+Diffs generally show improved codegen in the try bodies with cloned
+finallys. However some of this improvement comes from more aggressive
+use of callee save registers, and this causes size inflation in the
+funclets (note finally cloning does not alter the number of
+funclets). So if funclet save/restore could be contained to registers
+used in the funclet, the size impact would be slightly smaller.
+
+There are also some instances where cloning relatively small finallys
+leads to large code size increases. xxx is one example.