summaryrefslogtreecommitdiff
path: root/Documentation/design-docs/inline-size-estimates.md
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/design-docs/inline-size-estimates.md')
-rw-r--r--Documentation/design-docs/inline-size-estimates.md526
1 files changed, 526 insertions, 0 deletions
diff --git a/Documentation/design-docs/inline-size-estimates.md b/Documentation/design-docs/inline-size-estimates.md
new file mode 100644
index 0000000000..12286c7f5d
--- /dev/null
+++ b/Documentation/design-docs/inline-size-estimates.md
@@ -0,0 +1,526 @@
+# Inline Size Estimates
+
+Note this is work in progress....
+
+Inlining is a heuristic-based optimization. There are some cases where
+it's obvious that a particular inline is good or bad for performance,
+but more typically, things are not that clear. In those cases where
+the benfit is in doubt, the compiler will rely on heurstics, typically
+based on various estimates of an inline's size impact, speed impact,
+or other important factors.
+
+In this writeup we consider approaches to what should be the simplest
+of these estimates: the size impact of an inline.
+
+## Background
+
+There are a number of interesting facets to the inline size estimate
+problem, but let's start with some generalities. Suppose we have some
+size estimate for the caller `C` (ignoring for the moment exactly what
+we are measuring), say `CallerSize`, and some size estimate for the
+callee `E`, say `CalleeSize`. Now suppose we are contemplating what
+would happen if `E` is inlined into `C` to create `C'`. We'd like some
+sort of size estimate `CallerSize'`. The simplest estimate is that
+`CallerSize'` = `CallerSize + CalleeSize`.
+
+```
+(1) `CallerSize'` = `CallerSize + CalleeSize`
+```
+
+However, calling conventions impose some addtional code overhead on
+both the caller and callee. The caller must set up arguments in
+registers or on the stack, and if there is a return value, might need
+to move it or store it somewhere. It also might need to spill values
+that are stored in caller-saved registers around the call.
+
+The callee must set up a stack frame and possibly spill callee-save
+registers it intends to use, and then ultimately must undo all this.
+
+When we inline `E` into `C`, none of this calling convention setup is
+necessary. So one can imagine that there is some additional size
+savings from inlining, one that depends on the calling convention and
+the number and kind of arguments passed from caller to callee.
+
+```
+(2) CallerSize' = CallerSize + CalleeSize - Overhead
+```
+
+Note that it's entirely possible that `Overhead > CalleeSize`, so
+that `CallerSize' < CallerSize`, that is the inline not only results
+in faster code but also in smaller code. Indeed this state of affars
+is increasingly common with that advent of modern programming styles
+that emphasize building functionality out of lots of small procedures.
+
+Alternatively, we can compute the size impact as the size change of C
+because of the inline of E:
+
+```
+(2) SizeImpact = CallerSize - CallerSize'
+```
+
+or with a bit of math,
+
+```
+(2) SizeImpact = Overhead - CalleeSize
+```
+
+Now let's look at some actual data to see how well this simple model
+for `SizeImpact` holds up. The data below shows the measured size
+impact of a single inline into various calling methods in mscorlib.
+This data was obtained by first compiling all the methods without
+inlining to obtain a baseline size as measured by the encoded
+instruction size in bytes. The inliner was then modified so it would
+only do a single inline per caller, with the candidate chosen by a
+parameter `k` that could be varied externally. mscorlib was then
+repeatedly compiled with varying values of `k`, and the inlines
+performed and the subsequent modified caller sizes recorded. This data
+was then permuted and histogrammed to show the variety of `SizeImpact`
+values for a fixed `E`. Since the calling convention and number and
+kind of parameters is fixed across all observations for a fixed `E`,
+the better model would predict `SizeImpact` is the same in every
+case.
+
+In this first example, case `E` is `System.Object:.ctor()`, which is
+close to the simplest possible callee -- it takes one argument and has
+no return value. In all the observed cases the `SizeImpact` is
+negative, but it's far from the same in every case.
+
+
+```
+Inlining for System.Object:.ctor():this (size 6)
+Instances 1160 Mean SizeImpact -12.53 Min -52 Max -5
+Distribution
+ less than -50: 1
+ [-50,-41]: 0 1 1 0 0 1 0 0 1 1
+ [-40,-31]: 0 1 0 0 1 0 0 3 1 2
+ [-30,-21]: 1 3 2 2 1 8 76 5 22 18
+ [-20,-11]: 38 23 98 7 29 4 8 27 14 21
+ [-10, -1]: 1 362 373 0 0 3 0 0 0 0
+ [ 0, 9]: 0 0 0 0 0 0 0 0 0 0
+ [ 10, 19]: 0 0 0 0 0 0 0 0 0 0
+ [ 20, 29]: 0 0 0 0 0 0 0 0 0 0
+ [ 30, 39]: 0 0 0 0 0 0 0 0 0 0
+ [ 40, 49]: 0 0 0 0 0 0 0 0 0 0
+greater than 49: 0
+```
+
+It should be evident from this set of observations that `SizeImpact`
+cannot be completely characterized by a simple formula like `(2)`. For
+this inlinee, inlining always saves at least 5 bytes, roughly half the time
+it saves 8 or 9 bytes, on average it saves about 12.5 bytes, but it often
+saves considerably more.
+
+Other inlinees show similar spreads in `SizeImpact`. Here we see a case
+where sometimes the `SizeImpact` is negative and other times it's
+positive.
+
+```
+Inlining for System.Threading.CancellationToken:get_IsCancellationRequested():bool:this (size 29)
+Instances 42 SizeImpact Mean 11.33 Min -20 Max 28
+Distribution
+ less than -50: 0
+ [-50,-41]: 0 0 0 0 0 0 0 0 0 0
+ [-40,-31]: 0 0 0 0 0 0 0 0 0 0
+ [-30,-21]: 0 0 0 0 0 0 0 0 0 0
+ [-20,-11]: 1 0 0 0 0 0 0 1 0 0
+ [-10, -1]: 0 2 0 0 0 0 0 0 1 2
+ [ 0, 9]: 0 0 1 1 0 2 0 1 4 1
+ [ 10, 19]: 1 0 1 0 4 1 2 1 2 6
+ [ 20, 29]: 0 0 2 0 2 0 0 0 3 0
+ [ 30, 39]: 0 0 0 0 0 0 0 0 0 0
+ [ 40, 49]: 0 0 0 0 0 0 0 0 0 0
+greater than 49: 0
+```
+
+Not all inlinee `SizeImpacts` exhibit such wide distributions. Some spread
+just a little:
+
+```
+Inlining for System.Environment:GetResourceString(ref):ref (size 15)
+Instances 2238 SizeImpact Mean 0.01 Min -3 Max 6
+Distribution
+ less than -50: 0
+ [-50,-41]: 0 0 0 0 0 0 0 0 0 0
+ [-40,-31]: 0 0 0 0 0 0 0 0 0 0
+ [-30,-21]: 0 0 0 0 0 0 0 0 0 0
+ [-20,-11]: 0 0 0 0 0 0 0 0 0 0
+ [-10, -1]: 0 0 0 0 0 0 0 7 0 7
+ [ 0, 9]:2212 3 0 5 0 0 4 0 0 0
+ [ 10, 19]: 0 0 0 0 0 0 0 0 0 0
+ [ 20, 29]: 0 0 0 0 0 0 0 0 0 0
+ [ 30, 39]: 0 0 0 0 0 0 0 0 0 0
+ [ 40, 49]: 0 0 0 0 0 0 0 0 0 0
+greater than 49: 0
+```
+
+and some not at all:
+
+```
+Inlining for System.DateTime:get_Ticks():long:this (size 15)
+Instances 129 SizeImpact Mean 0.00 Min 0 Max 0
+Distribution
+ less than -50: 0
+ [-50,-41]: 0 0 0 0 0 0 0 0 0 0
+ [-40,-31]: 0 0 0 0 0 0 0 0 0 0
+ [-30,-21]: 0 0 0 0 0 0 0 0 0 0
+ [-20,-11]: 0 0 0 0 0 0 0 0 0 0
+ [-10, -1]: 0 0 0 0 0 0 0 0 0 0
+ [ 0, 9]: 129 0 0 0 0 0 0 0 0 0
+ [ 10, 19]: 0 0 0 0 0 0 0 0 0 0
+ [ 20, 29]: 0 0 0 0 0 0 0 0 0 0
+ [ 30, 39]: 0 0 0 0 0 0 0 0 0 0
+ [ 40, 49]: 0 0 0 0 0 0 0 0 0 0
+greater than 49: 0
+```
+
+So it appears there must be other -- perhaps many -- factors that
+influence the `SizeImpact` of a particular inline. The question is,
+what factors are important, how do we measure them, and how do they
+they combine to influence the overall result? The remainder of this
+document will look into some of the ways we can answer this question.
+
+## Some Ground Rules
+
+Before we move on, however, we should address what kind of information
+will actually be available to observe and use in building a heuristic
+model.
+
+It's not realistic to feed the heuristic on the actual encoded size of
+methods. While compiling a method `C` obtaining the encoded size of
+`C` is problematic. It's also quite likely that the encoded size of
+the inlinee `E` is unknown, except possibly in some AOT scenarios
+where one could generally arrange for methods to be compiled in
+bottom-up order (that is, callees before callers). While one could
+imagine spending compile time doing experimental compilations of `C`'s
+and `E`'s to see what code size results, this is probably too costly
+to be practical.
+
+Second, even if we could obtain the actual size of prospective inline
+candidates, we might not want to use this data. The final code
+sequence emitted by the compiler depends intimately on details of the
+target archtecture, runtime conventions (ABIs), and capabilites of the
+compiler phases that run after inlining. If we allow feedback into hey
+heuristics by incorporating data from these "downstream" sources, we
+introduce various forms of coupling that have important
+consequences. For example, it would be challenging to work on a
+downstream phase (say, register allocation) if some trivial change to
+allocation in an inlinee `E` perturbs inlining decisions for other
+methods like `C`. Likewise we may or may not want to allow runtime
+conventions to influence inlining -- if our application can run cross
+platform (say in both Windows and Linux) we might not want the various
+versions to exhibit vastly different inlining patterns.
+
+For that reason, we intend to restrict the information we can use in
+forming heuristics to things that are generally true of the caller and
+callee, and the general capabilities of the downstream phases. For the
+caller and callee this largely means information that can be obtained
+by inspecting either the input to the compiler, or the compiler's IR
+in the stages leading up to inlining, and perhaps (if available) some
+carefully chosen information obtained from actual compilation of the
+callee.
+
+At the same time, we still intend for the predictions made by the
+heuristics to be indicative of the final binary size. We have no other
+reliable source of truth to guide us.
+
+Given all this, we can restate the central question as: how best can
+the compiler estimate the ultimate size impact of an inline while
+restricting itself to considering features that are generally true of
+caller, callee, and the capabilities of the compiler?
+
+## Building a Heuristic Manually
+
+The tried-and-true approach is to build the heuristic manually. There
+are two prongs to the approach: the first is case analysis of actual
+behavior, and the second is modelling based on the compiler writer's
+experience and intution.
+
+### Some Case Analysis
+
+#### Case 1: Maximal Savings
+
+It's always instructive to look at actual behavior, so let's consider
+some of the cases. From the table above for `System.Object:.ctor()` we
+see a number of cases where there were large savings in byte size. The
+most significant of these is the following:
+
+```
+Inline System.Object:.ctor():this
+into System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this
+CalleeSize = 6
+CallerSize = 72
+CallerSize' = 24
+SizeImpact = -48
+```
+
+where `System.Object:.ctor():this` is simply:
+
+```Assembly
+; System.Object:.ctor():this
+; Total bytes of code 6, prolog size 5
+ 0F1F440000 nop
+ C3 ret
+```
+
+As an aside, one might wonder why there are 5 bytes of nops here --
+they are put there to support the *rejit* feature used by the
+application insights framework.
+
+The caller's source code shows it simply delegates immediately to
+another method to initialize the object:
+
+```C#
+ UnmanagedMemoryAccessor(
+ SafeBuffer buffer, Int64 offset,
+ Int64 capacity, FileAccess access) {
+ Initialize(buffer, offset, capacity, access);
+ }
+```
+
+Here's the code before the inline:
+
+```Assembly
+; BEFORE
+; System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this
+; Total bytes of code 72, prolog size 10
+ 4156 push r14
+ 57 push rdi
+ 56 push rsi
+ 55 push rbp
+ 53 push rbx
+ 4883EC20 sub rsp, 32
+ 488BF1 mov rsi, rcx
+ 488BFA mov rdi, rdx
+ 498BD8 mov rbx, r8
+ 498BE9 mov rbp, r9
+ 488BCE mov rcx, rsi
+ E800000000 call System.Object:.ctor():this
+ 448B742470 mov r14d, dword ptr [rsp+70H]
+ 4489742470 mov dword ptr [rsp+70H], r14d
+ 488BCE mov rcx, rsi
+ 488BD7 mov rdx, rdi
+ 4C8BC3 mov r8, rbx
+ 4C8BCD mov r9, rbp
+ 488D0500000000 lea rax, [(reloc)]
+ 4883C420 add rsp, 32
+ 5B pop rbx
+ 5D pop rbp
+ 5E pop rsi
+ 5F pop rdi
+ 415E pop r14
+ 48FFE0 rex.jmp rax
+```
+
+and here's the code after the inline:
+
+```Assembly
+; AFTER
+; System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this
+; Total bytes of code 24, prolog size 5
+ 0F1F440000 nop
+ 90 nop
+ 8B442428 mov eax, dword ptr [rsp+28H]
+ 89442428 mov dword ptr [rsp+28H], eax
+ 488D0500000000 lea rax, [(reloc)]
+ 48FFE0 rex.jmp rax
+```
+
+In both cases this method ends up tail calling to `Initialize` (via the
+`rex.jmp`), passing it the exact set of arguments that
+was passed to the method.
+
+In the before case, the un-inlined call changes the method from a leaf
+to a non-leaf. Leaf functions don't need to set up their own frame,
+while non-leaves do. So some of the extra code in the prolog and
+epilog is the frame setup. Beyond that, however, the un-inlined callee
+forces the compiler to move the arguments that will be ultimately be
+passed to the tail-called method out of the volatile registers and
+into preserved registers before the call, then back again to again to
+the proper argument registers after the call. And in order to use the
+callee saves these registers must be pushed in the prolog and popped
+in the epilog.
+
+In the after case, no frame setup is required, no registers need to be
+preserved across the call, so no registers need saving and
+restoring. The two `mov`s that remain appear to be gratuitous. The
+first `nop' is there for rejit padding. Not sure why the second one is
+there.
+
+So we can now see why this particular case has such extreme
+`SizeImpact`: the un-inlined call triggers frame creation and a fair
+amount of shuffling to accomodate the potential side effects of the
+call.
+
+#### Case 2: Typical Savings
+
+In the data above, the inline of `System.Object:.ctor()` typically
+saves either 8 or 9 bytes of code. Let's look at one such case.
+
+```
+Inline System.Object:.ctor():this
+into System.Security.SecurityElement:.ctor():this
+CalleeSize = 6
+CallerSize = 15
+CallerSize' = 6
+SizeImpact = -9
+```
+
+Here are before and after listings for the caller:
+
+```Assembly
+; BEFORE
+; System.Security.SecurityElement:.ctor():this
+; Total bytes of code 15, prolog size 5
+ 0F1F440000 nop
+ 488D0500000000 lea rax, [(reloc)]
+ 48FFE0 rex.jmp rax
+```
+
+```Assembly
+; AFTER
+; System.Security.SecurityElement:.ctor():this
+; Total bytes of code 6, prolog size 5
+ 0F1F440000 nop
+ C3 ret
+```
+
+In this case the call site was initially handled via a tail call that
+required a 7 byte `lea` and a 3 byte `rex.jmp`. With the inline all
+that's left is a single byte `ret`. This case covers 154 of the 187
+instances where inlining `System.Object:.ctor()` saved 9 bytes.
+
+### Cases 3, 4: Size May Decrease or Increase
+
+Now let's look at a set of examples where an inline can either
+decrease or increase the caller size. Here's the good case:
+
+```
+Inline System.IntPtr:op_Explicit(long):long
+into System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long)
+CalleeSize = 35
+CallerSize = 50
+CallerSize' = 22
+SizeImpact = -28
+```
+
+Here's the callee code:
+
+SIGH -- turns out the above has the wrong callee size -- there are
+several overloaded versions and my script picks up the wrong one.
+true callee is only 9 bytes long:
+
+```Assembly
+; System.IntPtr:op_Explicit(long):long
+; Total bytes of code 9, prolog size 5
+ 0F1F440000 nop
+ 488BC1 mov rax, rcx
+ C3 ret
+```
+
+At any rate the `SizeImpact` is still correct.
+
+Here's the before caller code:
+
+```Assembly
+; BEFORE
+; System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long)
+; Total bytes of code 50, prolog size 12
+ 55 push rbp
+ 57 push rdi
+ 56 push rsi
+ 4883EC20 sub rsp, 32
+ 488D6C2430 lea rbp, [rsp+30H]
+ 488BF1 mov rsi, rcx
+ 8BFA mov edi, edx
+ 498BC8 mov rcx, r8
+ E800000000 call System.IntPtr:op_Explicit(long):long
+ 4C8BC0 mov r8, rax
+ 8BD7 mov edx, edi
+ 488BCE mov rcx, rsi
+ 488D0500000000 lea rax, [(reloc)]
+ 488D65F0 lea rsp, [rbp-10H]
+ 5E pop rsi
+ 5F pop rdi
+ 5D pop rbp
+ 48FFE0 rex.jmp rax // tail call WriteInt64
+```
+
+and the after caller code:
+
+```Assembly
+; AFTER
+; System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long)
+; Total bytes of code 22, prolog size 10
+ 55 push rbp
+ 4883EC20 sub rsp, 32
+ 488D6C2420 lea rbp, [rsp+20H]
+ E800000000 call System.Runtime.InteropServices.Marshal:WriteInt64(ref,int,long)
+ 90 nop
+ 488D6500 lea rsp, [rbp]
+ 5D pop rbp
+ C3 ret
+```
+
+Somewhat confusingly, the inline has stopped the jit from making a
+tail call, and the size savings should be even greater than they
+are. There appears to be some kind of code generation issue within the
+jit. The IL for the callee is
+
+```
+IL_0000 0f 00 ldarga.s 0x0
+IL_0002 7b 8f 02 00 04 ldfld 0x400028F
+IL_0007 6e conv.u8
+IL_0008 2a ret
+
+```
+and the JIT rejects the tail call because
+
+```
+Rejecting tail call late for call [000008]: Local address taken
+```
+
+so presumably the `ldarga.s` in the inlinee is causing trouble.
+
+Now let's see if that same inlinee can cause a size increase. Here's a
+case:
+
+(NOTE this is calling a different overload..., need to get this
+straghtened out)
+
+```
+Inline System.IntPtr:op_Explicit(long):long [35]
+into EventData:get_DataPointer():long:this [18] size 35 delta 17
+CalleeSize = 35
+CallerSize = 18
+CallerSize' = 35
+SizeImpact = 17
+```
+
+```Assembly
+; BEFORE
+; EventData:get_DataPointer():long:this
+; Total bytes of code 18, prolog size
+ 0F1F440000 nop
+ 488B09 mov rcx, qword ptr [rcx]
+ 488D0500000000 lea rax, [(reloc)]
+ 48FFE0 rex.jmp rax
+```
+
+```Assembly
+; AFTER
+; EventData:get_DataPointer():long:this
+; Total bytes of code 35, prolog size 5
+ 4883EC28 sub rsp, 40
+ 488B11 mov rdx, qword ptr [rcx]
+ 33C9 xor rcx, rcx
+ 48894C2420 mov qword ptr [rsp+20H], rcx
+ 488D4C2420 lea rcx, bword ptr [rsp+20H]
+ E800000000 call System.IntPtr:.ctor(long):this
+ 488B442420 mov rax, qword ptr [rsp+20H]
+ 4883C428 add rsp, 40
+ C3 ret
+```
+
+Here the un-inlined case made a tail call. The inlined case was unable
+to do the same optimization for some reason, though it looks like it
+should have been able to do so as well.