Initial draft for inlining plans

I am pushing these early drafts out to be transparent about the work we anticipate doing for improving inlining in both RyuJit and LLILC. Expect revisions and more details to emerge over time.
author: Andy Ayers <andya@microsoft.com> 2015-10-21 16:09:28 -0700
committer: Andy Ayers <andya@microsoft.com> 2015-11-02 09:25:46 -0800
commit: cc1e6074bb7954f7e1913fd916a6fac571230c94 (patch)
tree: 7053591ee11419cdac3141fde022719e87fc7b1e /Documentation/design-docs
parent: 83b2cb83b32faa45b4f790237b5c5e259692294a (diff)
download: coreclr-cc1e6074bb7954f7e1913fd916a6fac571230c94.tar.gz
coreclr-cc1e6074bb7954f7e1913fd916a6fac571230c94.tar.bz2
coreclr-cc1e6074bb7954f7e1913fd916a6fac571230c94.zip
2 files changed, 964 insertions, 0 deletions
diff --git a/Documentation/design-docs/inline-size-estimates.md b/Documentation/design-docs/inline-size-estimates.md
new file mode 100644
index 0000000000..12286c7f5d
--- /dev/null
+++ b/Documentation/design-docs/inline-size-estimates.md
@@ -0,0 +1,526 @@
+# Inline Size Estimates 
+
+Note this is work in progress....
+
+Inlining is a heuristic-based optimization. There are some cases where
+it's obvious that a particular inline is good or bad for performance,
+but more typically, things are not that clear. In those cases where
+the benfit is in doubt, the compiler will rely on heurstics, typically
+based on various estimates of an inline's size impact, speed impact,
+or other important factors.
+
+In this writeup we consider approaches to what should be the simplest
+of these estimates: the size impact of an inline.
+
+## Background
+
+There are a number of interesting facets to the inline size estimate
+problem, but let's start with some generalities. Suppose we have some
+size estimate for the caller `C` (ignoring for the moment exactly what
+we are measuring), say `CallerSize`, and some size estimate for the
+callee `E`, say `CalleeSize`. Now suppose we are contemplating what
+would happen if `E` is inlined into `C` to create `C'`. We'd like some
+sort of size estimate `CallerSize'`.  The simplest estimate is that
+`CallerSize'` = `CallerSize + CalleeSize`.
+
+```
+(1)  `CallerSize'` = `CallerSize + CalleeSize`
+```
+
+However, calling conventions impose some addtional code overhead on
+both the caller and callee. The caller must set up arguments in
+registers or on the stack, and if there is a return value, might need
+to move it or store it somewhere. It also might need to spill values
+that are stored in caller-saved registers around the call.
+
+The callee must set up a stack frame and possibly spill callee-save
+registers it intends to use, and then ultimately must undo all this.
+
+When we inline `E` into `C`, none of this calling convention setup is
+necessary. So one can imagine that there is some additional size
+savings from inlining, one that depends on the calling convention and
+the number and kind of arguments passed from caller to callee.
+
+```
+(2)  CallerSize' = CallerSize + CalleeSize - Overhead
+```
+
+Note that it's entirely possible that `Overhead > CalleeSize`, so
+that `CallerSize' < CallerSize`, that is the inline not only results
+in faster code but also in smaller code. Indeed this state of affars
+is increasingly common with that advent of modern programming styles
+that emphasize building functionality out of lots of small procedures.
+
+Alternatively, we can compute the size impact as the size change of C
+because of the inline of E:
+
+```
+(2)  SizeImpact = CallerSize - CallerSize'
+```
+
+or with a bit of math,
+
+```
+(2)  SizeImpact = Overhead - CalleeSize
+```
+
+Now let's look at some actual data to see how well this simple model
+for `SizeImpact` holds up. The data below shows the measured size
+impact of a single inline into various calling methods in mscorlib.
+This data was obtained by first compiling all the methods without
+inlining to obtain a baseline size as measured by the encoded
+instruction size in bytes.  The inliner was then modified so it would
+only do a single inline per caller, with the candidate chosen by a
+parameter `k` that could be varied externally. mscorlib was then
+repeatedly compiled with varying values of `k`, and the inlines
+performed and the subsequent modified caller sizes recorded. This data
+was then permuted and histogrammed to show the variety of `SizeImpact`
+values for a fixed `E`.  Since the calling convention and number and
+kind of parameters is fixed across all observations for a fixed `E`,
+the better model would predict `SizeImpact` is the same in every
+case.
+
+In this first example, case `E` is `System.Object:.ctor()`, which is
+close to the simplest possible callee -- it takes one argument and has
+no return value. In all the observed cases the `SizeImpact` is
+negative, but it's far from the same in every case.
+
+
+```
+Inlining for System.Object:.ctor():this (size 6)
+Instances 1160 Mean SizeImpact -12.53 Min -52 Max -5
+Distribution
+  less than -50:   1
+      [-50,-41]:   0    1    1    0    0    1    0    0    1    1
+      [-40,-31]:   0    1    0    0    1    0    0    3    1    2
+      [-30,-21]:   1    3    2    2    1    8   76    5   22   18
+      [-20,-11]:  38   23   98    7   29    4    8   27   14   21
+      [-10, -1]:   1  362  373    0    0    3    0    0    0    0
+      [  0,  9]:   0    0    0    0    0    0    0    0    0    0
+      [ 10, 19]:   0    0    0    0    0    0    0    0    0    0
+      [ 20, 29]:   0    0    0    0    0    0    0    0    0    0
+      [ 30, 39]:   0    0    0    0    0    0    0    0    0    0
+      [ 40, 49]:   0    0    0    0    0    0    0    0    0    0
+greater than 49:   0
+```
+
+It should be evident from this set of observations that `SizeImpact`
+cannot be completely characterized by a simple formula like `(2)`. For
+this inlinee, inlining always saves at least 5 bytes, roughly half the time 
+it saves 8 or 9 bytes, on average it saves about 12.5 bytes, but it often
+saves considerably more.
+
+Other inlinees show similar spreads in `SizeImpact`. Here we see a case
+where sometimes the `SizeImpact` is negative and other times it's
+positive.
+
+```
+Inlining for System.Threading.CancellationToken:get_IsCancellationRequested():bool:this (size 29) 
+Instances 42 SizeImpact Mean 11.33 Min -20 Max 28
+Distribution
+  less than -50:   0
+      [-50,-41]:   0    0    0    0    0    0    0    0    0    0 
+      [-40,-31]:   0    0    0    0    0    0    0    0    0    0 
+      [-30,-21]:   0    0    0    0    0    0    0    0    0    0 
+      [-20,-11]:   1    0    0    0    0    0    0    1    0    0 
+      [-10, -1]:   0    2    0    0    0    0    0    0    1    2 
+      [  0,  9]:   0    0    1    1    0    2    0    1    4    1 
+      [ 10, 19]:   1    0    1    0    4    1    2    1    2    6 
+      [ 20, 29]:   0    0    2    0    2    0    0    0    3    0 
+      [ 30, 39]:   0    0    0    0    0    0    0    0    0    0 
+      [ 40, 49]:   0    0    0    0    0    0    0    0    0    0 
+greater than 49:   0
+```
+
+Not all inlinee `SizeImpacts` exhibit such wide distributions. Some spread
+just a little:
+
+```
+Inlining for System.Environment:GetResourceString(ref):ref (size 15) 
+Instances 2238 SizeImpact Mean 0.01 Min -3 Max 6
+Distribution
+  less than -50:   0 
+      [-50,-41]:   0    0    0    0    0    0    0    0    0    0 
+      [-40,-31]:   0    0    0    0    0    0    0    0    0    0 
+      [-30,-21]:   0    0    0    0    0    0    0    0    0    0 
+      [-20,-11]:   0    0    0    0    0    0    0    0    0    0 
+      [-10, -1]:   0    0    0    0    0    0    0    7    0    7 
+      [  0,  9]:2212    3    0    5    0    0    4    0    0    0 
+      [ 10, 19]:   0    0    0    0    0    0    0    0    0    0 
+      [ 20, 29]:   0    0    0    0    0    0    0    0    0    0 
+      [ 30, 39]:   0    0    0    0    0    0    0    0    0    0 
+      [ 40, 49]:   0    0    0    0    0    0    0    0    0    0 
+greater than 49:   0
+```
+
+and some not at all:
+
+```
+Inlining for System.DateTime:get_Ticks():long:this (size 15)
+Instances 129 SizeImpact Mean 0.00 Min 0 Max 0
+Distribution
+  less than -50:   0 
+      [-50,-41]:   0    0    0    0    0    0    0    0    0    0 
+      [-40,-31]:   0    0    0    0    0    0    0    0    0    0 
+      [-30,-21]:   0    0    0    0    0    0    0    0    0    0 
+      [-20,-11]:   0    0    0    0    0    0    0    0    0    0 
+      [-10, -1]:   0    0    0    0    0    0    0    0    0    0 
+      [  0,  9]: 129    0    0    0    0    0    0    0    0    0 
+      [ 10, 19]:   0    0    0    0    0    0    0    0    0    0 
+      [ 20, 29]:   0    0    0    0    0    0    0    0    0    0 
+      [ 30, 39]:   0    0    0    0    0    0    0    0    0    0 
+      [ 40, 49]:   0    0    0    0    0    0    0    0    0    0 
+greater than 49:   0
+```
+
+So it appears there must be other -- perhaps many -- factors that
+influence the `SizeImpact` of a particular inline. The question is,
+what factors are important, how do we measure them, and how do they
+they combine to influence the overall result? The remainder of this
+document will look into some of the ways we can answer this question.
+
+## Some Ground Rules
+
+Before we move on, however, we should address what kind of information
+will actually be available to observe and use in building a heuristic
+model.
+
+It's not realistic to feed the heuristic on the actual encoded size of
+methods. While compiling a method `C` obtaining the encoded size of
+`C` is problematic. It's also quite likely that the encoded size of
+the inlinee `E` is unknown, except possibly in some AOT scenarios
+where one could generally arrange for methods to be compiled in
+bottom-up order (that is, callees before callers). While one could
+imagine spending compile time doing experimental compilations of `C`'s
+and `E`'s to see what code size results, this is probably too costly
+to be practical.
+
+Second, even if we could obtain the actual size of prospective inline
+candidates, we might not want to use this data. The final code
+sequence emitted by the compiler depends intimately on details of the
+target archtecture, runtime conventions (ABIs), and capabilites of the
+compiler phases that run after inlining. If we allow feedback into hey
+heuristics by incorporating data from these "downstream" sources, we
+introduce various forms of coupling that have important
+consequences. For example, it would be challenging to work on a
+downstream phase (say, register allocation) if some trivial change to
+allocation in an inlinee `E` perturbs inlining decisions for other
+methods like `C`. Likewise we may or may not want to allow runtime
+conventions to influence inlining -- if our application can run cross
+platform (say in both Windows and Linux) we might not want the various
+versions to exhibit vastly different inlining patterns.
+
+For that reason, we intend to restrict the information we can use in
+forming heuristics to things that are generally true of the caller and
+callee, and the general capabilities of the downstream phases. For the
+caller and callee this largely means information that can be obtained
+by inspecting either the input to the compiler, or the compiler's IR
+in the stages leading up to inlining, and perhaps (if available) some
+carefully chosen information obtained from actual compilation of the
+callee.
+
+At the same time, we still intend for the predictions made by the
+heuristics to be indicative of the final binary size. We have no other
+reliable source of truth to guide us.
+
+Given all this, we can restate the central question as: how best can
+the compiler estimate the ultimate size impact of an inline while
+restricting itself to considering features that are generally true of
+caller, callee, and the capabilities of the compiler?
+
+## Building a Heuristic Manually
+
+The tried-and-true approach is to build the heuristic manually. There
+are two prongs to the approach: the first is case analysis of actual
+behavior, and the second is modelling based on the compiler writer's
+experience and intution.
+
+### Some Case Analysis
+
+#### Case 1: Maximal Savings
+
+It's always instructive to look at actual behavior, so let's consider
+some of the cases. From the table above for `System.Object:.ctor()` we
+see a number of cases where there were large savings in byte size. The
+most significant of these is the following:
+
+```
+Inline System.Object:.ctor():this
+into   System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this
+CalleeSize  = 6
+CallerSize  = 72
+CallerSize' = 24 
+SizeImpact  = -48
+```
+
+where `System.Object:.ctor():this` is simply:
+
+```Assembly
+; System.Object:.ctor():this
+; Total bytes of code 6, prolog size 5
+       0F1F440000           nop      
+       C3                   ret      
+```
+
+As an aside, one might wonder why there are 5 bytes of nops here --
+they are put there to support the *rejit* feature used by the
+application insights framework.
+
+The caller's source code shows it simply delegates immediately to
+another method to initialize the object:
+
+```C#
+ UnmanagedMemoryAccessor(
+     SafeBuffer buffer, Int64 offset, 
+     Int64 capacity, FileAccess access) {
+   Initialize(buffer, offset, capacity, access);
+ }
+```
+
+Here's the code before the inline:
+
+```Assembly
+; BEFORE
+; System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this
+; Total bytes of code 72, prolog size 10
+       4156                 push     r14
+       57                   push     rdi
+       56                   push     rsi
+       55                   push     rbp
+       53                   push     rbx
+       4883EC20             sub      rsp, 32
+       488BF1               mov      rsi, rcx
+       488BFA               mov      rdi, rdx
+       498BD8               mov      rbx, r8
+       498BE9               mov      rbp, r9
+       488BCE               mov      rcx, rsi
+       E800000000           call     System.Object:.ctor():this
+       448B742470           mov      r14d, dword ptr [rsp+70H]
+       4489742470           mov      dword ptr [rsp+70H], r14d
+       488BCE               mov      rcx, rsi
+       488BD7               mov      rdx, rdi
+       4C8BC3               mov      r8, rbx
+       4C8BCD               mov      r9, rbp
+       488D0500000000       lea      rax, [(reloc)]
+       4883C420             add      rsp, 32
+       5B                   pop      rbx
+       5D                   pop      rbp
+       5E                   pop      rsi
+       5F                   pop      rdi
+       415E                 pop      r14
+       48FFE0               rex.jmp  rax
+```
+
+and here's the code after the inline:
+
+```Assembly
+; AFTER
+; System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this
+; Total bytes of code 24, prolog size 5
+       0F1F440000           nop      
+       90                   nop      
+       8B442428             mov      eax, dword ptr [rsp+28H]
+       89442428             mov      dword ptr [rsp+28H], eax
+       488D0500000000       lea      rax, [(reloc)]
+       48FFE0               rex.jmp  rax
+```
+
+In both cases this method ends up tail calling to `Initialize` (via the
+`rex.jmp`), passing it the exact set of arguments that
+was passed to the method.
+
+In the before case, the un-inlined call changes the method from a leaf
+to a non-leaf.  Leaf functions don't need to set up their own frame,
+while non-leaves do. So some of the extra code in the prolog and
+epilog is the frame setup. Beyond that, however, the un-inlined callee
+forces the compiler to move the arguments that will be ultimately be
+passed to the tail-called method out of the volatile registers and
+into preserved registers before the call, then back again to again to
+the proper argument registers after the call.  And in order to use the
+callee saves these registers must be pushed in the prolog and popped
+in the epilog.
+
+In the after case, no frame setup is required, no registers need to be
+preserved across the call, so no registers need saving and
+restoring. The two `mov`s that remain appear to be gratuitous. The
+first `nop' is there for rejit padding. Not sure why the second one is
+there.
+
+So we can now see why this particular case has such extreme
+`SizeImpact`: the un-inlined call triggers frame creation and a fair
+amount of shuffling to accomodate the potential side effects of the
+call.
+
+#### Case 2: Typical Savings
+
+In the data above, the inline of `System.Object:.ctor()` typically
+saves either 8 or 9 bytes of code. Let's look at one such case.
+
+```
+Inline System.Object:.ctor():this
+into   System.Security.SecurityElement:.ctor():this
+CalleeSize = 6
+CallerSize = 15
+CallerSize' = 6 
+SizeImpact = -9
+```
+
+Here are before and after listings for the caller:
+
+```Assembly
+; BEFORE
+; System.Security.SecurityElement:.ctor():this
+; Total bytes of code 15, prolog size 5
+       0F1F440000           nop      
+       488D0500000000       lea      rax, [(reloc)]
+       48FFE0               rex.jmp  rax
+```
+
+```Assembly
+; AFTER
+; System.Security.SecurityElement:.ctor():this
+; Total bytes of code 6, prolog size 5
+       0F1F440000           nop      
+       C3                   ret      
+```
+
+In this case the call site was initially handled via a tail call that
+required a 7 byte `lea` and a 3 byte `rex.jmp`. With the inline all
+that's left is a single byte `ret`. This case covers 154 of the 187
+instances where inlining `System.Object:.ctor()` saved 9 bytes.
+
+### Cases 3, 4: Size May Decrease or Increase
+
+Now let's look at a set of examples where an inline can either
+decrease or increase the caller size. Here's the good case:
+
+```
+Inline System.IntPtr:op_Explicit(long):long 
+into   System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long)
+CalleeSize  = 35
+CallerSize  = 50
+CallerSize' = 22
+SizeImpact  = -28
+```
+
+Here's the callee code: 
+
+SIGH -- turns out the above has the wrong callee size -- there are
+several overloaded versions and my script picks up the wrong one.
+true callee is only 9 bytes long:
+
+```Assembly
+; System.IntPtr:op_Explicit(long):long
+; Total bytes of code 9, prolog size 5
+       0F1F440000           nop      
+       488BC1               mov      rax, rcx
+       C3                   ret      
+```
+
+At any rate the `SizeImpact` is still correct.
+
+Here's the before caller code:
+
+```Assembly
+; BEFORE
+; System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long)
+; Total bytes of code 50, prolog size 12
+       55                   push     rbp
+       57                   push     rdi
+       56                   push     rsi
+       4883EC20             sub      rsp, 32
+       488D6C2430           lea      rbp, [rsp+30H]
+       488BF1               mov      rsi, rcx
+       8BFA                 mov      edi, edx
+       498BC8               mov      rcx, r8
+       E800000000           call     System.IntPtr:op_Explicit(long):long
+       4C8BC0               mov      r8, rax
+       8BD7                 mov      edx, edi
+       488BCE               mov      rcx, rsi
+       488D0500000000       lea      rax, [(reloc)]
+       488D65F0             lea      rsp, [rbp-10H]
+       5E                   pop      rsi
+       5F                   pop      rdi
+       5D                   pop      rbp
+       48FFE0               rex.jmp  rax   // tail call WriteInt64
+```
+
+and the after caller code:
+
+```Assembly
+; AFTER
+; System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long)
+; Total bytes of code 22, prolog size 10
+       55                   push     rbp
+       4883EC20             sub      rsp, 32
+       488D6C2420           lea      rbp, [rsp+20H]
+       E800000000           call     System.Runtime.InteropServices.Marshal:WriteInt64(ref,int,long)
+       90                   nop      
+       488D6500             lea      rsp, [rbp]
+       5D                   pop      rbp
+       C3                   ret      
+```
+
+Somewhat confusingly, the inline has stopped the jit from making a
+tail call, and the size savings should be even greater than they
+are. There appears to be some kind of code generation issue within the
+jit. The IL for the callee is
+
+```
+IL_0000  0f 00             ldarga.s     0x0
+IL_0002  7b 8f 02 00 04    ldfld        0x400028F
+IL_0007  6e                conv.u8     
+IL_0008  2a                ret         
+
+```
+and the JIT rejects the tail call because
+
+```
+Rejecting tail call late for call [000008]: Local address taken
+```
+
+so presumably the `ldarga.s` in the inlinee is causing trouble.
+
+Now let's see if that same inlinee can cause a size increase. Here's a
+case:
+
+(NOTE this is calling a different overload..., need to get this
+straghtened out)
+
+```
+Inline System.IntPtr:op_Explicit(long):long [35] 
+into   EventData:get_DataPointer():long:this [18] size 35 delta 17
+CalleeSize  = 35
+CallerSize  = 18
+CallerSize' = 35
+SizeImpact  = 17
+```
+
+```Assembly
+; BEFORE
+; EventData:get_DataPointer():long:this
+; Total bytes of code 18, prolog size
+       0F1F440000           nop      
+       488B09               mov      rcx, qword ptr [rcx]
+       488D0500000000       lea      rax, [(reloc)]
+       48FFE0               rex.jmp  rax
+```
+
+```Assembly
+; AFTER
+; EventData:get_DataPointer():long:this
+; Total bytes of code 35, prolog size 5
+       4883EC28             sub      rsp, 40
+       488B11               mov      rdx, qword ptr [rcx]
+       33C9                 xor      rcx, rcx
+       48894C2420           mov      qword ptr [rsp+20H], rcx
+       488D4C2420           lea      rcx, bword ptr [rsp+20H]
+       E800000000           call     System.IntPtr:.ctor(long):this
+       488B442420           mov      rax, qword ptr [rsp+20H]
+       4883C428             add      rsp, 40
+       C3                   ret      
+```
+
+Here the un-inlined case made a tail call. The inlined case was unable
+to do the same optimization for some reason, though it looks like it
+should have been able to do so as well.
diff --git a/Documentation/design-docs/inlining-plans.md b/Documentation/design-docs/inlining-plans.md
new file mode 100644
index 0000000000..2279c16e37
--- /dev/null
+++ b/Documentation/design-docs/inlining-plans.md
@@ -0,0 +1,438 @@
+# Proposed Plans for Inlining
+
+This document outlines a plan for improving the inlining capabilities
+of RyuJit as well as adding inlining capabilities to LLILC.
+
+## Background
+
+Inlining is a key optimization, directly reducing call overhead and
+indirectly exposing a wider scope for powerful intraprocedural
+optimizations. 
+
+From an implementation standpoint, an inliner has the following aspects:
+
+* Machinery: transformations on the compiler's IR that physically
+incorporate the code for the callee into the caller.
+
+* Legality: a set of rules that describe when inlining is legal (that
+is, preserves program semantics). Here, the legality rules are largely
+dictated to us by the CoreCLR runtime.
+
+* Ability: a set of rules that describe whether the machinery is able
+to perform a legal inline and whether the result can be handled by
+downstream phases.
+
+* Profitability: a set of heuristic rules that describe which legal
+and able inlines are desirable.
+
+The general consensus is that the inlining in RyuJit can be improved,
+while carefully maintaining RyuJit's current lean compilation
+overhead. It is anticipated that this work will encompass Machinery,
+Ability, and Profitability.
+
+LLILC does no inlining today. Since we aspire to have LLILC be a
+high-performance .Net code generator, we need to enable inlining in
+LLILC. LLILC can likely leverage much of LLVM's built-in inlining
+Machinery and Ability, but will need new code for Legality and
+Profitability.
+
+We envision various scenarios for .Net Code generation that impact
+inlining: a first-tier JIT compiler, a higher-tier JIT compiler, a
+fast AOT compiler, and an optimizing AOT compiler. Each scenario calls
+for inlining, but the tradeoffs are different. For a given scenario,
+we may choose to deploy RyuJit or LLILC or both depending on the
+scenario requirements and code generator capabilities.
+
+There are also aspects of inlining that are runtime and target machine
+specific. Each runtime and target architecture will require some
+specialized tuning to accommodate the different costs of operations at
+runtime.
+
+Taking this all into account, the proposed scope of the work
+encompasses two different compiler code bases, a number of codegen
+scenarios, and a number of target architectures and runtime
+conventions. The goal is to come up with one conceptual framework that
+covers all these cases and avoids relying too much on time-intensive
+manual investigation and tuning.
+
+We will measure the quality of inlining along three axes: the time
+spent generating code (aka *throughput* -- TP) abbreviate as TP), the
+time spent executing the code (aka *code quality* -- CQ), and the size
+of the generated code (CS).
+
+We also desire that the resulting decision-making machinery (in particular
+the Profitability heuristic) is expressed in terms that are sensible to
+compiler writers.
+
+## The Proposal
+
+### Machinery
+
+RyuJit's machinery may or may not require updates depending on how many
+key inlines are rejected because of lack of ability. This is something
+that requires further investigation. We may also decide to pursue
+enhancements here to ensure that in cascaded inline cases the deeper
+callsites are provided up-to-date information.
+
+LLILC should be able to leverage LLVM's existing inline machinery. In
+JIT-like scenarios, LLILC's reader will need some updates to properly
+handle cases where a method is being "read" as an inline candidate
+rather than a root method to compiled. For AOT scenarios we ultimately
+will want to do something more elaborate where all of the methods
+being compiled are represented as LLVM IR.
+
+### Legality
+
+As mentioned above, the legality constraints are generally imposed by
+the CoreCLR runtime. 
+
+These constraints are already captured in the RyuJit source
+code. However, we generally prefer to at least logically perform all
+the legality checks before any of ability or profitability checks, so
+that we can reason more simply about why a particular inline didn't
+happen. This may not be entirely practical, since some of the
+profitability or ability early outs may be helping TP by keeping
+RyuJit from doing a lot of exploratory work before ultimately
+rejecting an inline candidate.
+
+LLILC lacks legality checking, and this needs to be implemented. 
+
+### Ability
+
+RyuJit currently bails out of some legal inlining cases because of
+internal restrictions. For instance, if a callee modifies one of its
+parameters, RyuJit won't inline it because of limitations in how the
+IR is incorporated. We expect to work on reducing or removing key
+ability limiters in RyuJit.
+
+Assuming the LLVM inlining framework is robust, LLILC shouldn't have
+too many ability constraints.
+
+### Profitability
+
+Given the wide range of scenarios, targets, and the two distinct code
+bases, we propose to heavily rely on machine learning techniques for
+building and tuning the Profitability components of the inliners.
+
+The general approach we suggest is based on past experiences we've had
+developing inliners for various compilers and is inspired by the CGO
+paper [Automatic Construction of Inlining Heuristics using Machine
+Learning](http://dl.acm.org/citation.cfm?id=2495914) by Kulkarni,
+Cavazos, Wimmer, and Simon.
+
+Kulkarni et. al. treat profitability as an unsupervised learning
+problem, and create a well-performing heuristic black box (neural
+network) using evolutionary programming techniques. They then turn
+around and use this black box as an oracle to label inline instances,
+and from this guide a supervised machine learning algorithm to produce
+a decision tree that expresses the profitability heuristics in terms
+sensible to the compiler writer.
+
+The inputs to the heuristic black box are various facts and estimates
+about an inline candidate, based on observations of the caller, callee,
+and call site (collectively, *features*). These features may be booleans, 
+enumerates (categories), integers, or floating-point values. For example:
+
+* (boolean) `CalleeIsLeaf`, `CallerSiteInTry`
+* (enumerate) `CalleeReturnCorType`
+* (int) `CalleeILSize`,`CallSiteProfileWeight`
+* (float) `CalleeLoadStoreFrequency`
+
+The output is a yes/no inlining decision for a particular candidate.
+
+The evolutionary process is governed by a fitness metric that expresses the
+appropriate balance of TP, CQ, and CS for the scenario. For instance, in a JIT
+scenario we likely will want to keep TP and CS close to the current
+baseline values. For AOT we may want to allow higher TP and larger CS if we
+can demonstrate CQ improvements.
+
+## Implications
+
+For this approach to work, we will need to make frequent and robust
+measurements of the key fitness metrics across a range of
+representative benchmarks. So this presumes:
+
+* we can identify and capture these key benchmarks
+* we can develop methodology for doing appropriate measurements
+* we can do these measurements at appropriate scale
+
+### The Benchmarks
+
+Clearly in a machine learning approach the actual observations made
+are crucial. We need a solid, representative benchmark set. But this
+is a general problem with a heuristic based optimization and not an
+inherent limitation of this approach -- one can only measure what one
+can measure.
+
+We plan on capturing the results in standard schema and hope that with
+this and the regularization of the metric data (see next section) that
+interested 3rd parties can perhaps supply their own benchmark results.
+
+### The Metrics
+
+CS is typically very easy to measure and the measurements are "noise
+free". So the measurement only needs to be done once.
+
+If TP and CQ are both measured in terms of time (or cycles) then the
+measurements will be inherently noisy. To reduce the variance caused
+by noise some repeated runs may be necessary, and it might also be
+necessary to run on dedicated, matched hardware. This severely limits
+scalability.
+
+We instead propose to measure both TP and CQ in terms of instructions
+retired. While this may not be entirely noise-free, we expect the
+measurements to be low noise and fairly insensitive to other
+computations on the machine. Thus this should give us decent scale
+(for instance we can run scenarios in the cloud on VMs with varying
+underlying hardware, etc).
+
+For TP, instructions retired should be a reasonable proxy for time,
+since we expect the CPI for the compiler proper to be relatively
+static. For CQ, we realize there may not be as strong a correlation
+with time, but typically the inliner operates early in the compilation
+process and inlining is mainly focused on reducing instruction count
+and less about reducing cycle stalls.
+
+We'll do some experimentation early on to see how well these proxy
+metrics hold up.
+
+### Analysis
+
+To help prioritize work on profitability and ability (and perhaps find
+areas where we might discuss trying to remove legality constraints),
+we need some mechanism to correlate inlining outcomes with success and
+failure reasons.
+
+We propose to develop a textual, machine-readable inline log schema
+that describes a set of inlining decisions. This schema can be
+produced by a compiler as it makes decisions, and contains sufficient
+correlating data so that we can then match it up with runtime
+observations. For example, we might end up with the following sort of
+per-method output (though more likely it will be embedded in some kind of
+xml or json markup):
+
+```
+  Inlining for 9811 Lookup
+     [o] 22825 Lookup@Lattice
+        [o] 22827 ??0?$interior_ptr@PAVCell 
+        [x] 22826 Lookup@Lattice@@@Z (reason: SizeLimit)
+     [o] 21728 ?get_CellList
+
+```
+
+where `[o]` is a successful inline, `[x]` a failed inline, and
+indentation shows the inlining tree. For .Net compilation we'll need
+some kind of persistent ID for methods, which may not be all that easy
+to come by. 
+
+This inline log can also be read back by the code generator to enable
+*inline replay* and force the inliner to perform a particular pattern
+of inlines. This is useful for things like performance investigations,
+bug isolation, and for trying to have one compiler emulate the
+inlining behavior of another. Since the logs are textual they can also
+be edited by hand if necessary to perform experiments.
+
+### Obtaining Runtime Data
+
+If we can obtain runtime data about the frequency of call
+instructions, we can run this through the log to accumulate failure
+reasons and create a failure reason table like the following
+(categories and data here totally made up):
+
+```
+  FailedReason       | Count    | [%]   |  Sites |   [%]
+---------------------|----------|-------|--------|------
+SpeculativeNonGround | 79347955 | 37.8% |    417 | 14.1%
+NonGroundShared      | 28756634 | 13.7% |    217 | 07.4%
+SizeLimit            | 22066033 | 10.5% |    760 | 25.8%
+ExternFunction       | 21923402 | 10.5% |    300 | 10.2%
+... etc ...
+```
+
+This can be used to prioritize inlining work. For instance, if
+`SpeculativeNonGround` was an ability issue, and we wanted to improve
+CQ, we'd give it high priority to fix, since it impacts inlining most
+frequently.
+
+### Robustness
+
+Since we'll likely be altering inlining decisions, we may uncover bugs
+or performance blemishes in the compilers.
+
+It may make sense to build up a random inlining methodology (RyuJit
+evidently has something like this) to try and stress test the compiler
+in the face of inlining changes. Log replay can be used with automatic
+bisection tools to try and isolate the particular inline that exposes
+a bug.
+
+Performance issues are harder to spot automatically, and will likely
+require some amount of manual analysis. However, we may be able to study
+inline instances that are not well-predicted by the inlining models -- for
+instance, cases where the CS is much greater than predicted, or CQ is lower --
+and see if the modelling shortfall is an indicator of problems within the
+compiler itself.
+
+## Development Stages
+
+Given the above, here's a rough draft outline for the how the
+functionality could be built.
+
+### Preliminaries -- Measurement
+
+1. Make some benchmarks available for measurement.
+
+2. Demonstrate we can reliably measure the metrics of interest: CS,
+CQ, and TP (as bytes, instructions, and instructions) when jitting and
+pre-jitting the benchmarks. Get some idea of the noise levels for CQ
+and TP. Hopefully they are small.
+
+### Preliminaries -- RyuJit
+
+1. Refactor RyuJit code to clearly separate out legality, ability,
+and profitability aspects of the inliner. Hopefully this can be done
+with no CQ changes and no TP impact.
+
+2. Develop the inline log format.
+
+3. Produce inline logs from RyuJit.
+
+4. Develop methodology for runtime correlation.
+
+5. Produce failure reason analysis for the benchmark set.  Decide
+which if any abilities need improving.
+
+6. Implement inline log replay for RyuJit.
+
+7. Modify logs by hand or via a tool and replay them to assess the
+landscape of TP/CQ/CS tradeoffs from inliner changes.
+
+### Preliminaries -- LLILC
+
+1. Add code to capture legality constraints when reading in a method.
+
+2. Generalize LLILC's MSIL reader to read in an inlinee in
+the context of the caller.
+
+3. Write a simple LLVM pass (or specialize an existing pass) to
+perform inlining. Initially just implement a very simple profitability
+heuristic. Verify it all works correctly.
+
+4. Implement inline log generation by LLILC.
+
+5. Implement inline log replay by LLILC.
+
+6. Modify logs by hand or via a tool and replay them to assess the
+landscape of TP/CQ/CS tradeoffs from inliner changes.
+
+### Develop the Heuristics (JIT case)
+
+1. Start by trying to produce code size estimates for a given inline
+candidate. Measure actual code size impacts for individual inlining
+decisions.
+
+2. Determine features to surface and produce a labelled data set of
+(feature*, size impact). Use this to build a machine-learning model
+estimator for code size.
+
+3. We probably can't just plug the new size estimate into the existing
+heuristic framework and expect good results (generally, better
+information in does not always result in better decisions out).  So
+instead build up a parallel (off by default) heuristic path where we
+use the new size estimator.
+
+4. Surface features for the CQ and TP impacts. Build a pluggable model
+for the black box heuristic in the compiler. Likely this means that
+the model code is written by a tool from some textual description.
+
+5. Run evolutionary experiments to produce an optimized black box
+heuristic. This involves: (a) randomly or otherwise generating an
+initial population of models; (b) running each model on the
+benchmarks, gathering metrics; (c) combining metrics across benchmarks
+to produce fitness; (d) using fitness to create a new set of models
+using various genetic approaches; (e) iterating until satisfied with
+the result.
+
+All of this setup should be automated and available in the CI system.
+
+6. Use the winning model to produce a labelled data set of (feature*,
+inlining decision) for all the cases in the benchmarks. Build a
+decision tree from this via standard techniques. Decide if any pruning
+is appropriate.
+
+7. Back validate the results via measurement to show (a) final heuristic
+works as expected and metrics match expectations on training examples;
+cross validate on untrained benchmarks to show results are not over-fitted.
+Flight this to various 3rd parties for independent confirmation.
+
+8. Enable new heuristics as defaults.
+
+9. Repeat the above process for each code base, scenario,
+architecture, etc.
+
+### AOT Considerations
+
+In the AOT case (particularly the optimized AOT) we will likely need
+to develop a different set of heuristics.
+
+The current frameworks for AOT (NGEN and Ready2Run) give the code
+generator the opportunity to preview the methods to be compiled. We'll
+take advantage of that to build a call graph and orchestrate
+compilation of methods in a "bottom-up" fashion (callees before
+callers). Because of that, when inlining, the caller typically has the
+opportunity to look at a richer set of data about the callee -- for
+instance, detailed usage summary for parameters and similar.
+
+The inliner also has the opportunity to plan inlining over a wider
+scope, perhaps adjusting heuristics for the relative importance of
+callers or the shape of the call graph or similar.
+
+## Vetting and Revalidation
+
+With an evaluation framework in place, we can now run carefully
+controlled experiments to see if further enhancements to the inliner
+are warranted. For example, the order in which candidates are
+considered may have an impact -- ideally the inliner would greedily
+look at the best candidates first. Other feature data may prove
+important, for instance actual or synthetic profile data.
+
+The rest of the compiler also changes over time. Periodically we
+should rerun the heuristic development to see if the current set of
+heuristics are still good, especially if major new capabilities are
+added to the compiler.
+
+We should also continually strive to improve the benchmark set.
+
+## Risks
+
+The plan calls for the inliner's Profitability logic to be constructed
+via machine learning. This exposes us to a number of risks:
+
+* We may not understand how to successfully apply ML techniques.
+* We may run into one of the common ML pitfalls, like having too many
+features, or strongly correlated features.
+* We may not be using the right features.
+* Some of the features may themselves be estimates (eg code size).
+* The benchmark set may not be sufficiently representative.
+* We may overfit the benchmark set. The plan here is to cross-validate and
+not use all our benchmarks as training examples.
+* The training process may be too slow to be practical.
+* There may be too much measurement noise.
+* The resulting models may not be stable -- meaning that small changes in
+the input programs might trigger large changes in the inlining decisions.
+* The supervised learning output may still be black-box like in practice.
+* The downstream phases of the compiler might not adapt well to inlining
+changes or may offer an uneven performance surface.
+* We've found it tricky in the past to produce a good combined fitness metric
+from a large set of benchmarks and metrics.
+* It's generally not possible to build a new heuristic which is better in
+every way than the old one -- some metrics on some benchmarks will inevitably
+regress.
+
+However, it's worth pointing out that a manually constructed heuristic faces
+many of these same risks.
+
+## Notes
+
+See this [early
+draft](https://github.com/AndyAyersMS/coreclr/commit/035054402a345f643d9dee0ec31dbdf5fadbb17c),
+now orphaned by a squash, for some interesting comments.
author	Andy Ayers <andya@microsoft.com>	2015-10-21 16:09:28 -0700
committer	Andy Ayers <andya@microsoft.com>	2015-11-02 09:25:46 -0800
commit	cc1e6074bb7954f7e1913fd916a6fac571230c94 (patch)
tree	7053591ee11419cdac3141fde022719e87fc7b1e /Documentation/design-docs
parent	83b2cb83b32faa45b4f790237b5c5e259692294a (diff)
download	coreclr-cc1e6074bb7954f7e1913fd916a6fac571230c94.tar.gz coreclr-cc1e6074bb7954f7e1913fd916a6fac571230c94.tar.bz2 coreclr-cc1e6074bb7954f7e1913fd916a6fac571230c94.zip