diff options
Diffstat (limited to 'Documentation/design-docs/inlining-plans.md')
-rw-r--r-- | Documentation/design-docs/inlining-plans.md | 438 |
1 files changed, 438 insertions, 0 deletions
diff --git a/Documentation/design-docs/inlining-plans.md b/Documentation/design-docs/inlining-plans.md new file mode 100644 index 0000000000..2279c16e37 --- /dev/null +++ b/Documentation/design-docs/inlining-plans.md @@ -0,0 +1,438 @@ +# Proposed Plans for Inlining + +This document outlines a plan for improving the inlining capabilities +of RyuJit as well as adding inlining capabilities to LLILC. + +## Background + +Inlining is a key optimization, directly reducing call overhead and +indirectly exposing a wider scope for powerful intraprocedural +optimizations. + +From an implementation standpoint, an inliner has the following aspects: + +* Machinery: transformations on the compiler's IR that physically +incorporate the code for the callee into the caller. + +* Legality: a set of rules that describe when inlining is legal (that +is, preserves program semantics). Here, the legality rules are largely +dictated to us by the CoreCLR runtime. + +* Ability: a set of rules that describe whether the machinery is able +to perform a legal inline and whether the result can be handled by +downstream phases. + +* Profitability: a set of heuristic rules that describe which legal +and able inlines are desirable. + +The general consensus is that the inlining in RyuJit can be improved, +while carefully maintaining RyuJit's current lean compilation +overhead. It is anticipated that this work will encompass Machinery, +Ability, and Profitability. + +LLILC does no inlining today. Since we aspire to have LLILC be a +high-performance .Net code generator, we need to enable inlining in +LLILC. LLILC can likely leverage much of LLVM's built-in inlining +Machinery and Ability, but will need new code for Legality and +Profitability. + +We envision various scenarios for .Net Code generation that impact +inlining: a first-tier JIT compiler, a higher-tier JIT compiler, a +fast AOT compiler, and an optimizing AOT compiler. Each scenario calls +for inlining, but the tradeoffs are different. For a given scenario, +we may choose to deploy RyuJit or LLILC or both depending on the +scenario requirements and code generator capabilities. + +There are also aspects of inlining that are runtime and target machine +specific. Each runtime and target architecture will require some +specialized tuning to accommodate the different costs of operations at +runtime. + +Taking this all into account, the proposed scope of the work +encompasses two different compiler code bases, a number of codegen +scenarios, and a number of target architectures and runtime +conventions. The goal is to come up with one conceptual framework that +covers all these cases and avoids relying too much on time-intensive +manual investigation and tuning. + +We will measure the quality of inlining along three axes: the time +spent generating code (aka *throughput* -- TP) abbreviate as TP), the +time spent executing the code (aka *code quality* -- CQ), and the size +of the generated code (CS). + +We also desire that the resulting decision-making machinery (in particular +the Profitability heuristic) is expressed in terms that are sensible to +compiler writers. + +## The Proposal + +### Machinery + +RyuJit's machinery may or may not require updates depending on how many +key inlines are rejected because of lack of ability. This is something +that requires further investigation. We may also decide to pursue +enhancements here to ensure that in cascaded inline cases the deeper +callsites are provided up-to-date information. + +LLILC should be able to leverage LLVM's existing inline machinery. In +JIT-like scenarios, LLILC's reader will need some updates to properly +handle cases where a method is being "read" as an inline candidate +rather than a root method to compiled. For AOT scenarios we ultimately +will want to do something more elaborate where all of the methods +being compiled are represented as LLVM IR. + +### Legality + +As mentioned above, the legality constraints are generally imposed by +the CoreCLR runtime. + +These constraints are already captured in the RyuJit source +code. However, we generally prefer to at least logically perform all +the legality checks before any of ability or profitability checks, so +that we can reason more simply about why a particular inline didn't +happen. This may not be entirely practical, since some of the +profitability or ability early outs may be helping TP by keeping +RyuJit from doing a lot of exploratory work before ultimately +rejecting an inline candidate. + +LLILC lacks legality checking, and this needs to be implemented. + +### Ability + +RyuJit currently bails out of some legal inlining cases because of +internal restrictions. For instance, if a callee modifies one of its +parameters, RyuJit won't inline it because of limitations in how the +IR is incorporated. We expect to work on reducing or removing key +ability limiters in RyuJit. + +Assuming the LLVM inlining framework is robust, LLILC shouldn't have +too many ability constraints. + +### Profitability + +Given the wide range of scenarios, targets, and the two distinct code +bases, we propose to heavily rely on machine learning techniques for +building and tuning the Profitability components of the inliners. + +The general approach we suggest is based on past experiences we've had +developing inliners for various compilers and is inspired by the CGO +paper [Automatic Construction of Inlining Heuristics using Machine +Learning](http://dl.acm.org/citation.cfm?id=2495914) by Kulkarni, +Cavazos, Wimmer, and Simon. + +Kulkarni et. al. treat profitability as an unsupervised learning +problem, and create a well-performing heuristic black box (neural +network) using evolutionary programming techniques. They then turn +around and use this black box as an oracle to label inline instances, +and from this guide a supervised machine learning algorithm to produce +a decision tree that expresses the profitability heuristics in terms +sensible to the compiler writer. + +The inputs to the heuristic black box are various facts and estimates +about an inline candidate, based on observations of the caller, callee, +and call site (collectively, *features*). These features may be booleans, +enumerates (categories), integers, or floating-point values. For example: + +* (boolean) `CalleeIsLeaf`, `CallerSiteInTry` +* (enumerate) `CalleeReturnCorType` +* (int) `CalleeILSize`,`CallSiteProfileWeight` +* (float) `CalleeLoadStoreFrequency` + +The output is a yes/no inlining decision for a particular candidate. + +The evolutionary process is governed by a fitness metric that expresses the +appropriate balance of TP, CQ, and CS for the scenario. For instance, in a JIT +scenario we likely will want to keep TP and CS close to the current +baseline values. For AOT we may want to allow higher TP and larger CS if we +can demonstrate CQ improvements. + +## Implications + +For this approach to work, we will need to make frequent and robust +measurements of the key fitness metrics across a range of +representative benchmarks. So this presumes: + +* we can identify and capture these key benchmarks +* we can develop methodology for doing appropriate measurements +* we can do these measurements at appropriate scale + +### The Benchmarks + +Clearly in a machine learning approach the actual observations made +are crucial. We need a solid, representative benchmark set. But this +is a general problem with a heuristic based optimization and not an +inherent limitation of this approach -- one can only measure what one +can measure. + +We plan on capturing the results in standard schema and hope that with +this and the regularization of the metric data (see next section) that +interested 3rd parties can perhaps supply their own benchmark results. + +### The Metrics + +CS is typically very easy to measure and the measurements are "noise +free". So the measurement only needs to be done once. + +If TP and CQ are both measured in terms of time (or cycles) then the +measurements will be inherently noisy. To reduce the variance caused +by noise some repeated runs may be necessary, and it might also be +necessary to run on dedicated, matched hardware. This severely limits +scalability. + +We instead propose to measure both TP and CQ in terms of instructions +retired. While this may not be entirely noise-free, we expect the +measurements to be low noise and fairly insensitive to other +computations on the machine. Thus this should give us decent scale +(for instance we can run scenarios in the cloud on VMs with varying +underlying hardware, etc). + +For TP, instructions retired should be a reasonable proxy for time, +since we expect the CPI for the compiler proper to be relatively +static. For CQ, we realize there may not be as strong a correlation +with time, but typically the inliner operates early in the compilation +process and inlining is mainly focused on reducing instruction count +and less about reducing cycle stalls. + +We'll do some experimentation early on to see how well these proxy +metrics hold up. + +### Analysis + +To help prioritize work on profitability and ability (and perhaps find +areas where we might discuss trying to remove legality constraints), +we need some mechanism to correlate inlining outcomes with success and +failure reasons. + +We propose to develop a textual, machine-readable inline log schema +that describes a set of inlining decisions. This schema can be +produced by a compiler as it makes decisions, and contains sufficient +correlating data so that we can then match it up with runtime +observations. For example, we might end up with the following sort of +per-method output (though more likely it will be embedded in some kind of +xml or json markup): + +``` + Inlining for 9811 Lookup + [o] 22825 Lookup@Lattice + [o] 22827 ??0?$interior_ptr@PAVCell + [x] 22826 Lookup@Lattice@@@Z (reason: SizeLimit) + [o] 21728 ?get_CellList + +``` + +where `[o]` is a successful inline, `[x]` a failed inline, and +indentation shows the inlining tree. For .Net compilation we'll need +some kind of persistent ID for methods, which may not be all that easy +to come by. + +This inline log can also be read back by the code generator to enable +*inline replay* and force the inliner to perform a particular pattern +of inlines. This is useful for things like performance investigations, +bug isolation, and for trying to have one compiler emulate the +inlining behavior of another. Since the logs are textual they can also +be edited by hand if necessary to perform experiments. + +### Obtaining Runtime Data + +If we can obtain runtime data about the frequency of call +instructions, we can run this through the log to accumulate failure +reasons and create a failure reason table like the following +(categories and data here totally made up): + +``` + FailedReason | Count | [%] | Sites | [%] +---------------------|----------|-------|--------|------ +SpeculativeNonGround | 79347955 | 37.8% | 417 | 14.1% +NonGroundShared | 28756634 | 13.7% | 217 | 07.4% +SizeLimit | 22066033 | 10.5% | 760 | 25.8% +ExternFunction | 21923402 | 10.5% | 300 | 10.2% +... etc ... +``` + +This can be used to prioritize inlining work. For instance, if +`SpeculativeNonGround` was an ability issue, and we wanted to improve +CQ, we'd give it high priority to fix, since it impacts inlining most +frequently. + +### Robustness + +Since we'll likely be altering inlining decisions, we may uncover bugs +or performance blemishes in the compilers. + +It may make sense to build up a random inlining methodology (RyuJit +evidently has something like this) to try and stress test the compiler +in the face of inlining changes. Log replay can be used with automatic +bisection tools to try and isolate the particular inline that exposes +a bug. + +Performance issues are harder to spot automatically, and will likely +require some amount of manual analysis. However, we may be able to study +inline instances that are not well-predicted by the inlining models -- for +instance, cases where the CS is much greater than predicted, or CQ is lower -- +and see if the modelling shortfall is an indicator of problems within the +compiler itself. + +## Development Stages + +Given the above, here's a rough draft outline for the how the +functionality could be built. + +### Preliminaries -- Measurement + +1. Make some benchmarks available for measurement. + +2. Demonstrate we can reliably measure the metrics of interest: CS, +CQ, and TP (as bytes, instructions, and instructions) when jitting and +pre-jitting the benchmarks. Get some idea of the noise levels for CQ +and TP. Hopefully they are small. + +### Preliminaries -- RyuJit + +1. Refactor RyuJit code to clearly separate out legality, ability, +and profitability aspects of the inliner. Hopefully this can be done +with no CQ changes and no TP impact. + +2. Develop the inline log format. + +3. Produce inline logs from RyuJit. + +4. Develop methodology for runtime correlation. + +5. Produce failure reason analysis for the benchmark set. Decide +which if any abilities need improving. + +6. Implement inline log replay for RyuJit. + +7. Modify logs by hand or via a tool and replay them to assess the +landscape of TP/CQ/CS tradeoffs from inliner changes. + +### Preliminaries -- LLILC + +1. Add code to capture legality constraints when reading in a method. + +2. Generalize LLILC's MSIL reader to read in an inlinee in +the context of the caller. + +3. Write a simple LLVM pass (or specialize an existing pass) to +perform inlining. Initially just implement a very simple profitability +heuristic. Verify it all works correctly. + +4. Implement inline log generation by LLILC. + +5. Implement inline log replay by LLILC. + +6. Modify logs by hand or via a tool and replay them to assess the +landscape of TP/CQ/CS tradeoffs from inliner changes. + +### Develop the Heuristics (JIT case) + +1. Start by trying to produce code size estimates for a given inline +candidate. Measure actual code size impacts for individual inlining +decisions. + +2. Determine features to surface and produce a labelled data set of +(feature*, size impact). Use this to build a machine-learning model +estimator for code size. + +3. We probably can't just plug the new size estimate into the existing +heuristic framework and expect good results (generally, better +information in does not always result in better decisions out). So +instead build up a parallel (off by default) heuristic path where we +use the new size estimator. + +4. Surface features for the CQ and TP impacts. Build a pluggable model +for the black box heuristic in the compiler. Likely this means that +the model code is written by a tool from some textual description. + +5. Run evolutionary experiments to produce an optimized black box +heuristic. This involves: (a) randomly or otherwise generating an +initial population of models; (b) running each model on the +benchmarks, gathering metrics; (c) combining metrics across benchmarks +to produce fitness; (d) using fitness to create a new set of models +using various genetic approaches; (e) iterating until satisfied with +the result. + +All of this setup should be automated and available in the CI system. + +6. Use the winning model to produce a labelled data set of (feature*, +inlining decision) for all the cases in the benchmarks. Build a +decision tree from this via standard techniques. Decide if any pruning +is appropriate. + +7. Back validate the results via measurement to show (a) final heuristic +works as expected and metrics match expectations on training examples; +cross validate on untrained benchmarks to show results are not over-fitted. +Flight this to various 3rd parties for independent confirmation. + +8. Enable new heuristics as defaults. + +9. Repeat the above process for each code base, scenario, +architecture, etc. + +### AOT Considerations + +In the AOT case (particularly the optimized AOT) we will likely need +to develop a different set of heuristics. + +The current frameworks for AOT (NGEN and Ready2Run) give the code +generator the opportunity to preview the methods to be compiled. We'll +take advantage of that to build a call graph and orchestrate +compilation of methods in a "bottom-up" fashion (callees before +callers). Because of that, when inlining, the caller typically has the +opportunity to look at a richer set of data about the callee -- for +instance, detailed usage summary for parameters and similar. + +The inliner also has the opportunity to plan inlining over a wider +scope, perhaps adjusting heuristics for the relative importance of +callers or the shape of the call graph or similar. + +## Vetting and Revalidation + +With an evaluation framework in place, we can now run carefully +controlled experiments to see if further enhancements to the inliner +are warranted. For example, the order in which candidates are +considered may have an impact -- ideally the inliner would greedily +look at the best candidates first. Other feature data may prove +important, for instance actual or synthetic profile data. + +The rest of the compiler also changes over time. Periodically we +should rerun the heuristic development to see if the current set of +heuristics are still good, especially if major new capabilities are +added to the compiler. + +We should also continually strive to improve the benchmark set. + +## Risks + +The plan calls for the inliner's Profitability logic to be constructed +via machine learning. This exposes us to a number of risks: + +* We may not understand how to successfully apply ML techniques. +* We may run into one of the common ML pitfalls, like having too many +features, or strongly correlated features. +* We may not be using the right features. +* Some of the features may themselves be estimates (eg code size). +* The benchmark set may not be sufficiently representative. +* We may overfit the benchmark set. The plan here is to cross-validate and +not use all our benchmarks as training examples. +* The training process may be too slow to be practical. +* There may be too much measurement noise. +* The resulting models may not be stable -- meaning that small changes in +the input programs might trigger large changes in the inlining decisions. +* The supervised learning output may still be black-box like in practice. +* The downstream phases of the compiler might not adapt well to inlining +changes or may offer an uneven performance surface. +* We've found it tricky in the past to produce a good combined fitness metric +from a large set of benchmarks and metrics. +* It's generally not possible to build a new heuristic which is better in +every way than the old one -- some metrics on some benchmarks will inevitably +regress. + +However, it's worth pointing out that a manually constructed heuristic faces +many of these same risks. + +## Notes + +See this [early +draft](https://github.com/AndyAyersMS/coreclr/commit/035054402a345f643d9dee0ec31dbdf5fadbb17c), +now orphaned by a squash, for some interesting comments. |