diff options
Diffstat (limited to 'Documentation/design-docs/inline-size-estimates.md')
-rw-r--r-- | Documentation/design-docs/inline-size-estimates.md | 526 |
1 files changed, 526 insertions, 0 deletions
diff --git a/Documentation/design-docs/inline-size-estimates.md b/Documentation/design-docs/inline-size-estimates.md new file mode 100644 index 0000000000..12286c7f5d --- /dev/null +++ b/Documentation/design-docs/inline-size-estimates.md @@ -0,0 +1,526 @@ +# Inline Size Estimates + +Note this is work in progress.... + +Inlining is a heuristic-based optimization. There are some cases where +it's obvious that a particular inline is good or bad for performance, +but more typically, things are not that clear. In those cases where +the benfit is in doubt, the compiler will rely on heurstics, typically +based on various estimates of an inline's size impact, speed impact, +or other important factors. + +In this writeup we consider approaches to what should be the simplest +of these estimates: the size impact of an inline. + +## Background + +There are a number of interesting facets to the inline size estimate +problem, but let's start with some generalities. Suppose we have some +size estimate for the caller `C` (ignoring for the moment exactly what +we are measuring), say `CallerSize`, and some size estimate for the +callee `E`, say `CalleeSize`. Now suppose we are contemplating what +would happen if `E` is inlined into `C` to create `C'`. We'd like some +sort of size estimate `CallerSize'`. The simplest estimate is that +`CallerSize'` = `CallerSize + CalleeSize`. + +``` +(1) `CallerSize'` = `CallerSize + CalleeSize` +``` + +However, calling conventions impose some addtional code overhead on +both the caller and callee. The caller must set up arguments in +registers or on the stack, and if there is a return value, might need +to move it or store it somewhere. It also might need to spill values +that are stored in caller-saved registers around the call. + +The callee must set up a stack frame and possibly spill callee-save +registers it intends to use, and then ultimately must undo all this. + +When we inline `E` into `C`, none of this calling convention setup is +necessary. So one can imagine that there is some additional size +savings from inlining, one that depends on the calling convention and +the number and kind of arguments passed from caller to callee. + +``` +(2) CallerSize' = CallerSize + CalleeSize - Overhead +``` + +Note that it's entirely possible that `Overhead > CalleeSize`, so +that `CallerSize' < CallerSize`, that is the inline not only results +in faster code but also in smaller code. Indeed this state of affars +is increasingly common with that advent of modern programming styles +that emphasize building functionality out of lots of small procedures. + +Alternatively, we can compute the size impact as the size change of C +because of the inline of E: + +``` +(2) SizeImpact = CallerSize - CallerSize' +``` + +or with a bit of math, + +``` +(2) SizeImpact = Overhead - CalleeSize +``` + +Now let's look at some actual data to see how well this simple model +for `SizeImpact` holds up. The data below shows the measured size +impact of a single inline into various calling methods in mscorlib. +This data was obtained by first compiling all the methods without +inlining to obtain a baseline size as measured by the encoded +instruction size in bytes. The inliner was then modified so it would +only do a single inline per caller, with the candidate chosen by a +parameter `k` that could be varied externally. mscorlib was then +repeatedly compiled with varying values of `k`, and the inlines +performed and the subsequent modified caller sizes recorded. This data +was then permuted and histogrammed to show the variety of `SizeImpact` +values for a fixed `E`. Since the calling convention and number and +kind of parameters is fixed across all observations for a fixed `E`, +the better model would predict `SizeImpact` is the same in every +case. + +In this first example, case `E` is `System.Object:.ctor()`, which is +close to the simplest possible callee -- it takes one argument and has +no return value. In all the observed cases the `SizeImpact` is +negative, but it's far from the same in every case. + + +``` +Inlining for System.Object:.ctor():this (size 6) +Instances 1160 Mean SizeImpact -12.53 Min -52 Max -5 +Distribution + less than -50: 1 + [-50,-41]: 0 1 1 0 0 1 0 0 1 1 + [-40,-31]: 0 1 0 0 1 0 0 3 1 2 + [-30,-21]: 1 3 2 2 1 8 76 5 22 18 + [-20,-11]: 38 23 98 7 29 4 8 27 14 21 + [-10, -1]: 1 362 373 0 0 3 0 0 0 0 + [ 0, 9]: 0 0 0 0 0 0 0 0 0 0 + [ 10, 19]: 0 0 0 0 0 0 0 0 0 0 + [ 20, 29]: 0 0 0 0 0 0 0 0 0 0 + [ 30, 39]: 0 0 0 0 0 0 0 0 0 0 + [ 40, 49]: 0 0 0 0 0 0 0 0 0 0 +greater than 49: 0 +``` + +It should be evident from this set of observations that `SizeImpact` +cannot be completely characterized by a simple formula like `(2)`. For +this inlinee, inlining always saves at least 5 bytes, roughly half the time +it saves 8 or 9 bytes, on average it saves about 12.5 bytes, but it often +saves considerably more. + +Other inlinees show similar spreads in `SizeImpact`. Here we see a case +where sometimes the `SizeImpact` is negative and other times it's +positive. + +``` +Inlining for System.Threading.CancellationToken:get_IsCancellationRequested():bool:this (size 29) +Instances 42 SizeImpact Mean 11.33 Min -20 Max 28 +Distribution + less than -50: 0 + [-50,-41]: 0 0 0 0 0 0 0 0 0 0 + [-40,-31]: 0 0 0 0 0 0 0 0 0 0 + [-30,-21]: 0 0 0 0 0 0 0 0 0 0 + [-20,-11]: 1 0 0 0 0 0 0 1 0 0 + [-10, -1]: 0 2 0 0 0 0 0 0 1 2 + [ 0, 9]: 0 0 1 1 0 2 0 1 4 1 + [ 10, 19]: 1 0 1 0 4 1 2 1 2 6 + [ 20, 29]: 0 0 2 0 2 0 0 0 3 0 + [ 30, 39]: 0 0 0 0 0 0 0 0 0 0 + [ 40, 49]: 0 0 0 0 0 0 0 0 0 0 +greater than 49: 0 +``` + +Not all inlinee `SizeImpacts` exhibit such wide distributions. Some spread +just a little: + +``` +Inlining for System.Environment:GetResourceString(ref):ref (size 15) +Instances 2238 SizeImpact Mean 0.01 Min -3 Max 6 +Distribution + less than -50: 0 + [-50,-41]: 0 0 0 0 0 0 0 0 0 0 + [-40,-31]: 0 0 0 0 0 0 0 0 0 0 + [-30,-21]: 0 0 0 0 0 0 0 0 0 0 + [-20,-11]: 0 0 0 0 0 0 0 0 0 0 + [-10, -1]: 0 0 0 0 0 0 0 7 0 7 + [ 0, 9]:2212 3 0 5 0 0 4 0 0 0 + [ 10, 19]: 0 0 0 0 0 0 0 0 0 0 + [ 20, 29]: 0 0 0 0 0 0 0 0 0 0 + [ 30, 39]: 0 0 0 0 0 0 0 0 0 0 + [ 40, 49]: 0 0 0 0 0 0 0 0 0 0 +greater than 49: 0 +``` + +and some not at all: + +``` +Inlining for System.DateTime:get_Ticks():long:this (size 15) +Instances 129 SizeImpact Mean 0.00 Min 0 Max 0 +Distribution + less than -50: 0 + [-50,-41]: 0 0 0 0 0 0 0 0 0 0 + [-40,-31]: 0 0 0 0 0 0 0 0 0 0 + [-30,-21]: 0 0 0 0 0 0 0 0 0 0 + [-20,-11]: 0 0 0 0 0 0 0 0 0 0 + [-10, -1]: 0 0 0 0 0 0 0 0 0 0 + [ 0, 9]: 129 0 0 0 0 0 0 0 0 0 + [ 10, 19]: 0 0 0 0 0 0 0 0 0 0 + [ 20, 29]: 0 0 0 0 0 0 0 0 0 0 + [ 30, 39]: 0 0 0 0 0 0 0 0 0 0 + [ 40, 49]: 0 0 0 0 0 0 0 0 0 0 +greater than 49: 0 +``` + +So it appears there must be other -- perhaps many -- factors that +influence the `SizeImpact` of a particular inline. The question is, +what factors are important, how do we measure them, and how do they +they combine to influence the overall result? The remainder of this +document will look into some of the ways we can answer this question. + +## Some Ground Rules + +Before we move on, however, we should address what kind of information +will actually be available to observe and use in building a heuristic +model. + +It's not realistic to feed the heuristic on the actual encoded size of +methods. While compiling a method `C` obtaining the encoded size of +`C` is problematic. It's also quite likely that the encoded size of +the inlinee `E` is unknown, except possibly in some AOT scenarios +where one could generally arrange for methods to be compiled in +bottom-up order (that is, callees before callers). While one could +imagine spending compile time doing experimental compilations of `C`'s +and `E`'s to see what code size results, this is probably too costly +to be practical. + +Second, even if we could obtain the actual size of prospective inline +candidates, we might not want to use this data. The final code +sequence emitted by the compiler depends intimately on details of the +target archtecture, runtime conventions (ABIs), and capabilites of the +compiler phases that run after inlining. If we allow feedback into hey +heuristics by incorporating data from these "downstream" sources, we +introduce various forms of coupling that have important +consequences. For example, it would be challenging to work on a +downstream phase (say, register allocation) if some trivial change to +allocation in an inlinee `E` perturbs inlining decisions for other +methods like `C`. Likewise we may or may not want to allow runtime +conventions to influence inlining -- if our application can run cross +platform (say in both Windows and Linux) we might not want the various +versions to exhibit vastly different inlining patterns. + +For that reason, we intend to restrict the information we can use in +forming heuristics to things that are generally true of the caller and +callee, and the general capabilities of the downstream phases. For the +caller and callee this largely means information that can be obtained +by inspecting either the input to the compiler, or the compiler's IR +in the stages leading up to inlining, and perhaps (if available) some +carefully chosen information obtained from actual compilation of the +callee. + +At the same time, we still intend for the predictions made by the +heuristics to be indicative of the final binary size. We have no other +reliable source of truth to guide us. + +Given all this, we can restate the central question as: how best can +the compiler estimate the ultimate size impact of an inline while +restricting itself to considering features that are generally true of +caller, callee, and the capabilities of the compiler? + +## Building a Heuristic Manually + +The tried-and-true approach is to build the heuristic manually. There +are two prongs to the approach: the first is case analysis of actual +behavior, and the second is modelling based on the compiler writer's +experience and intution. + +### Some Case Analysis + +#### Case 1: Maximal Savings + +It's always instructive to look at actual behavior, so let's consider +some of the cases. From the table above for `System.Object:.ctor()` we +see a number of cases where there were large savings in byte size. The +most significant of these is the following: + +``` +Inline System.Object:.ctor():this +into System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this +CalleeSize = 6 +CallerSize = 72 +CallerSize' = 24 +SizeImpact = -48 +``` + +where `System.Object:.ctor():this` is simply: + +```Assembly +; System.Object:.ctor():this +; Total bytes of code 6, prolog size 5 + 0F1F440000 nop + C3 ret +``` + +As an aside, one might wonder why there are 5 bytes of nops here -- +they are put there to support the *rejit* feature used by the +application insights framework. + +The caller's source code shows it simply delegates immediately to +another method to initialize the object: + +```C# + UnmanagedMemoryAccessor( + SafeBuffer buffer, Int64 offset, + Int64 capacity, FileAccess access) { + Initialize(buffer, offset, capacity, access); + } +``` + +Here's the code before the inline: + +```Assembly +; BEFORE +; System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this +; Total bytes of code 72, prolog size 10 + 4156 push r14 + 57 push rdi + 56 push rsi + 55 push rbp + 53 push rbx + 4883EC20 sub rsp, 32 + 488BF1 mov rsi, rcx + 488BFA mov rdi, rdx + 498BD8 mov rbx, r8 + 498BE9 mov rbp, r9 + 488BCE mov rcx, rsi + E800000000 call System.Object:.ctor():this + 448B742470 mov r14d, dword ptr [rsp+70H] + 4489742470 mov dword ptr [rsp+70H], r14d + 488BCE mov rcx, rsi + 488BD7 mov rdx, rdi + 4C8BC3 mov r8, rbx + 4C8BCD mov r9, rbp + 488D0500000000 lea rax, [(reloc)] + 4883C420 add rsp, 32 + 5B pop rbx + 5D pop rbp + 5E pop rsi + 5F pop rdi + 415E pop r14 + 48FFE0 rex.jmp rax +``` + +and here's the code after the inline: + +```Assembly +; AFTER +; System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this +; Total bytes of code 24, prolog size 5 + 0F1F440000 nop + 90 nop + 8B442428 mov eax, dword ptr [rsp+28H] + 89442428 mov dword ptr [rsp+28H], eax + 488D0500000000 lea rax, [(reloc)] + 48FFE0 rex.jmp rax +``` + +In both cases this method ends up tail calling to `Initialize` (via the +`rex.jmp`), passing it the exact set of arguments that +was passed to the method. + +In the before case, the un-inlined call changes the method from a leaf +to a non-leaf. Leaf functions don't need to set up their own frame, +while non-leaves do. So some of the extra code in the prolog and +epilog is the frame setup. Beyond that, however, the un-inlined callee +forces the compiler to move the arguments that will be ultimately be +passed to the tail-called method out of the volatile registers and +into preserved registers before the call, then back again to again to +the proper argument registers after the call. And in order to use the +callee saves these registers must be pushed in the prolog and popped +in the epilog. + +In the after case, no frame setup is required, no registers need to be +preserved across the call, so no registers need saving and +restoring. The two `mov`s that remain appear to be gratuitous. The +first `nop' is there for rejit padding. Not sure why the second one is +there. + +So we can now see why this particular case has such extreme +`SizeImpact`: the un-inlined call triggers frame creation and a fair +amount of shuffling to accomodate the potential side effects of the +call. + +#### Case 2: Typical Savings + +In the data above, the inline of `System.Object:.ctor()` typically +saves either 8 or 9 bytes of code. Let's look at one such case. + +``` +Inline System.Object:.ctor():this +into System.Security.SecurityElement:.ctor():this +CalleeSize = 6 +CallerSize = 15 +CallerSize' = 6 +SizeImpact = -9 +``` + +Here are before and after listings for the caller: + +```Assembly +; BEFORE +; System.Security.SecurityElement:.ctor():this +; Total bytes of code 15, prolog size 5 + 0F1F440000 nop + 488D0500000000 lea rax, [(reloc)] + 48FFE0 rex.jmp rax +``` + +```Assembly +; AFTER +; System.Security.SecurityElement:.ctor():this +; Total bytes of code 6, prolog size 5 + 0F1F440000 nop + C3 ret +``` + +In this case the call site was initially handled via a tail call that +required a 7 byte `lea` and a 3 byte `rex.jmp`. With the inline all +that's left is a single byte `ret`. This case covers 154 of the 187 +instances where inlining `System.Object:.ctor()` saved 9 bytes. + +### Cases 3, 4: Size May Decrease or Increase + +Now let's look at a set of examples where an inline can either +decrease or increase the caller size. Here's the good case: + +``` +Inline System.IntPtr:op_Explicit(long):long +into System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long) +CalleeSize = 35 +CallerSize = 50 +CallerSize' = 22 +SizeImpact = -28 +``` + +Here's the callee code: + +SIGH -- turns out the above has the wrong callee size -- there are +several overloaded versions and my script picks up the wrong one. +true callee is only 9 bytes long: + +```Assembly +; System.IntPtr:op_Explicit(long):long +; Total bytes of code 9, prolog size 5 + 0F1F440000 nop + 488BC1 mov rax, rcx + C3 ret +``` + +At any rate the `SizeImpact` is still correct. + +Here's the before caller code: + +```Assembly +; BEFORE +; System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long) +; Total bytes of code 50, prolog size 12 + 55 push rbp + 57 push rdi + 56 push rsi + 4883EC20 sub rsp, 32 + 488D6C2430 lea rbp, [rsp+30H] + 488BF1 mov rsi, rcx + 8BFA mov edi, edx + 498BC8 mov rcx, r8 + E800000000 call System.IntPtr:op_Explicit(long):long + 4C8BC0 mov r8, rax + 8BD7 mov edx, edi + 488BCE mov rcx, rsi + 488D0500000000 lea rax, [(reloc)] + 488D65F0 lea rsp, [rbp-10H] + 5E pop rsi + 5F pop rdi + 5D pop rbp + 48FFE0 rex.jmp rax // tail call WriteInt64 +``` + +and the after caller code: + +```Assembly +; AFTER +; System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long) +; Total bytes of code 22, prolog size 10 + 55 push rbp + 4883EC20 sub rsp, 32 + 488D6C2420 lea rbp, [rsp+20H] + E800000000 call System.Runtime.InteropServices.Marshal:WriteInt64(ref,int,long) + 90 nop + 488D6500 lea rsp, [rbp] + 5D pop rbp + C3 ret +``` + +Somewhat confusingly, the inline has stopped the jit from making a +tail call, and the size savings should be even greater than they +are. There appears to be some kind of code generation issue within the +jit. The IL for the callee is + +``` +IL_0000 0f 00 ldarga.s 0x0 +IL_0002 7b 8f 02 00 04 ldfld 0x400028F +IL_0007 6e conv.u8 +IL_0008 2a ret + +``` +and the JIT rejects the tail call because + +``` +Rejecting tail call late for call [000008]: Local address taken +``` + +so presumably the `ldarga.s` in the inlinee is causing trouble. + +Now let's see if that same inlinee can cause a size increase. Here's a +case: + +(NOTE this is calling a different overload..., need to get this +straghtened out) + +``` +Inline System.IntPtr:op_Explicit(long):long [35] +into EventData:get_DataPointer():long:this [18] size 35 delta 17 +CalleeSize = 35 +CallerSize = 18 +CallerSize' = 35 +SizeImpact = 17 +``` + +```Assembly +; BEFORE +; EventData:get_DataPointer():long:this +; Total bytes of code 18, prolog size + 0F1F440000 nop + 488B09 mov rcx, qword ptr [rcx] + 488D0500000000 lea rax, [(reloc)] + 48FFE0 rex.jmp rax +``` + +```Assembly +; AFTER +; EventData:get_DataPointer():long:this +; Total bytes of code 35, prolog size 5 + 4883EC28 sub rsp, 40 + 488B11 mov rdx, qword ptr [rcx] + 33C9 xor rcx, rcx + 48894C2420 mov qword ptr [rsp+20H], rcx + 488D4C2420 lea rcx, bword ptr [rsp+20H] + E800000000 call System.IntPtr:.ctor(long):this + 488B442420 mov rax, qword ptr [rsp+20H] + 4883C428 add rsp, 40 + C3 ret +``` + +Here the un-inlined case made a tail call. The inlined case was unable +to do the same optimization for some reason, though it looks like it +should have been able to do so as well. |