summaryrefslogtreecommitdiff
path: root/Documentation
diff options
context:
space:
mode:
authorJoseph Tremoulet <jotrem@microsoft.com>2017-07-20 13:28:53 -0400
committerJoseph Tremoulet <JCTremoulet@gmail.com>2017-07-31 15:52:35 -0400
commit6b38dca32dee8321dafab8be92366d17da2b8bec (patch)
tree57e9afbf7d90e272a6211b4ad3f039fc0aeed22d /Documentation
parentf17fae2a1aa1bcc312cf15d6857d30cfef00c2d0 (diff)
downloadcoreclr-6b38dca32dee8321dafab8be92366d17da2b8bec.tar.gz
coreclr-6b38dca32dee8321dafab8be92366d17da2b8bec.tar.bz2
coreclr-6b38dca32dee8321dafab8be92366d17da2b8bec.zip
Add documents about JIT optimization planning
This change adds two documents: - JitOptimizerPlanningGuide.md discusses how we can/do/should go about identifying, prioritizing, and validating optimization improvement opportunities, as well as several ideas for how we might improve the process. - JitOptimizerTodoAssessment.md lists several potential optimization improvements that always come up in planning discussions, with brief notes about each, to capture current thinking.
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/performance/JitOptimizerPlanningGuide.md127
-rw-r--r--Documentation/performance/JitOptimizerTodoAssessment.md134
2 files changed, 261 insertions, 0 deletions
diff --git a/Documentation/performance/JitOptimizerPlanningGuide.md b/Documentation/performance/JitOptimizerPlanningGuide.md
new file mode 100644
index 0000000000..6f65f146e0
--- /dev/null
+++ b/Documentation/performance/JitOptimizerPlanningGuide.md
@@ -0,0 +1,127 @@
+JIT Optimizer Planning Guide
+============================
+
+The goal of this document is to capture some thinking about the process used to
+prioritize and validate optimizer investments. The overriding goal of such
+investments is to help ensure that the dotnet platform satisfies developers'
+performance needs.
+
+
+Benchmarking
+------------
+
+There are a number of public benchmarks which evaluate different platforms'
+relative performance, so naturally dotnet's scores on such benchmarks give
+some indication of how well it satisfies developers' performance needs. The JIT
+team has used some of these benchmarks, particularly [TechEmpower](https://www.techempower.com/benchmarks/)
+and [Benchmarks Game](http://benchmarksgame.alioth.debian.org/), for scouting
+out optimization opportunities and prioritizing optimization improvements.
+While it is important to track scores on such benchmarks to validate performance
+changes in the dotnet platform as a whole, when it comes to planning and
+prioritizing JIT optimization improvements specifically, they aren't sufficient,
+due to a few well-known issues:
+
+ - For macro-benchmarks, such as TechEmpower, compiler optimization is often not
+ the dominant factor in performance. The effects of individual optimizer
+ changes are most often in the sub-percent range, well below the noise level
+ of the measurements, which will usually be at least 3% or so even for the
+ most well-behaved macro-benchmarks.
+ - Source-level changes can be made much more rapidly than compiler optimization
+ changes. This means that for anything we're trying to track where the whole
+ team is effecting changes in source, runtime, etc., any particular code
+ sequence we may target with optimization improvements may well be targeted
+ with source changes in the interim, nullifying the measured benefit of the
+ optimization change when it is eventually merged. Source/library/runtime
+ changes are in play for TechEmpower and Benchmarks Game both.
+
+Compiler micro-benchmarks (like those in our [test tree](https://github.com/dotnet/coreclr/tree/master/tests/src/JIT/Performance/CodeQuality))
+don't share these issues, and adding them as optimizations are implemented is
+critical for validation and regression prevention; however, micro-benchmarks
+often aren't as representative of real-world code, and therefore not as
+reflective of developers' performance needs, so aren't well suited for scouting
+out and prioritizing opportunities.
+
+
+Benefits of JIT Optimization
+----------------------------
+
+While source changes can more rapidly and dramatically effect changes to
+targeted hot code sequences in macro-benchmarks, compiler changes have the
+advantage that they apply broadly to all compiled code. One of the best reasons
+to invest in compiler optimization improvements is to capitalize on this. A few
+specific benefits:
+
+ - Optimizer changes can effect "peanut-butter" improvements; by making an
+ improvement which is small in any particular instance to a code sequence that
+ is repeated thousands of times across a codebase, they can produce substantial
+ cumulative wins. These should accrue toward the standard metrics (benchmark
+ scores and code size), but identifying the most profitable "peanut-butter"
+ opportunities is difficult. Improving our methodology for identifying such
+ opportunities would be helpful; some ideas are below.
+ - Optimizer changes can unblock coding patterns that performance-sensitive
+ developers want to employ but consider prohibitively expensive. They may
+ have inelegant works-around in their code, such as gotos for loop-exiting
+ returns to work around poor block layout, manually scalarized structs to work
+ around poor struct promotion, manually unrolled loops to work around lack of
+ loop unrolling, limited use of lambdas to work around inefficient access to
+ heap-allocated closures, etc. The more the optimizer can improve such
+ situations, the better, as it both increases developer productivity and
+ increases the usefulness of abstractions provided by the language and
+ libraries. Finding a measurable metric to track this type of improvement
+ poses a challenge, but would be a big help toward prioritizing and validating
+ optimization improvements; again, some ideas are below.
+
+
+Brainstorm
+----------
+
+Listed here are several ideas for undertakings we might pursue to improve our
+ability to identify opportunities and validate/track improvements that mesh
+with the benefits discussed above. Thinking here is in the early stages, but
+the hope is that with some thought/discussion some of these will surface as
+worth investing in.
+
+ - Is there telemetry we can implement/analyze to identify "peanut-butter"
+ opportunities, or target "coding pattern"s? Probably easier to use this
+ to evaluate/prioritize patterns we're considering targeting than to identify
+ the patterns in the first place.
+ - Can we construct some sort of "peanut-butter profiler"? The idea would
+ roughly be to aggregate samples/counters under particular input constructs
+ rather than aggregate them under callstack. Might it be interesting to
+ group by MSIL opcode, or opcode pair, or opcode triplet... ?
+ - It might behoove us to build up some SPMI traces that could be data-mined
+ for any of these experiments.
+ - We should make it easy to view machine code emitted by the jit, and to
+ collect profiles and correlate them with that machine code. This could
+ benefit any developers doing performance analysis of their own code.
+ The JIT team has discussed this, options include building something on top of
+ the profiler APIs, enabling COMPlus_JitDisasm in release builds, and shipping
+ with or making easily available an alt jit that supports JitDisasm.
+ - Hardware companies maintain optimization/performance guides for their ISAs.
+ Should we maintain one for MSIL and/or C# (and/or F#)? If we hosted such a
+ thing somewhere publicly votable, we could track which anti-patterns people
+ find most frustrating to avoid, and subsequent removal of them. Does such
+ a guide already exist somewhere, that we could use as a starting point?
+ Should we collate GitHub issues or Stack Overflow issues to create such a thing?
+ - Maybe we should expand our labels on GitHub so that there are sub-areas
+ within "optimization"? It could help prioritize by letting us compare the
+ relative sizes of those buckets.
+ - Can we more effectively leverage the legacy JIT codebases for comparative
+ analysis? We've compared micro-benchmark performance against Jit64 and
+ manually compared disassembly of hot code, what else can we do? One concrete
+ idea: run over some large corpus of code (SPMI?), and do a path-length
+ comparison e.g. by looking at each sequence of k MSIL instructions (for some
+ small k), and for each combination of k opcodes collect statistics on the
+ size of generated machine code (maybe using debug line number info to do the
+ correlation?), then look for common sequences which are much longer with
+ RyuJIT.
+ - Maybe hook RyuJIT up to some sort of superoptimizer to identify opportunities?
+ - Microsoft Research has done some experimenting that involved converting RyuJIT
+ IR to LLVM IR; perhaps we could use this to identify common expressions that
+ could be much better optimized.
+ - What's a practical way to establish a metric of "unblocked coding patterns"?
+ - How developers give feedback about patterns/performance could use some thought;
+ the GitHub issue list is open, but does it need to be publicized somehow? We
+ perhaps should have some regular process where we pull issues over from other
+ places where people report/discuss dotnet performance issues, like
+ [Stack Overflow](https://stackoverflow.com/questions/tagged/performance+.net).
diff --git a/Documentation/performance/JitOptimizerTodoAssessment.md b/Documentation/performance/JitOptimizerTodoAssessment.md
new file mode 100644
index 0000000000..7d53dab5f5
--- /dev/null
+++ b/Documentation/performance/JitOptimizerTodoAssessment.md
@@ -0,0 +1,134 @@
+Optimizer Codebase Status/Investments
+=====================================
+
+There are a number of areas in the optimizer that we know we would invest in
+improving if resources were unlimited. This document lists them and some
+thoughts about their current state and prioritization, in an effort to capture
+the thinking about them that comes up in planning discussions.
+
+
+Improved Struct Handling
+------------------------
+
+This is an area that has received recent attention, with the [first-class structs](https://github.com/dotnet/coreclr/blob/master/Documentation/design-docs/first-class-structs.md)
+work and the struct promotion improvements that went in for `Span<T>`. Work here
+is expected to continue and can happen incrementally. Possible next steps:
+
+ - Struct promotion stress mode (test mode to improve robustness/reliability)
+ - Promotion of more structs; relax limits on e.g. field count (should generally
+ help performance-sensitive code where structs are increasingly used to avoid
+ heap allocations)
+ - Improve handling of System V struct passing (I think we currently insert
+ some unnecessary round-trips through memory at call boundaries due to
+ internal representation issues)
+ - Implicit byref parameter promotion w/o shadow copy
+
+We don't have specific benchmarks that we know would jump in response to any of
+these. May well be able to find some with some looking, though this may be an
+area where current performance-sensitive code avoids structs.
+
+
+Exception handling
+------------------
+
+This is increasingly important as C# language constructs like async/await and
+certain `foreach` incantations are implemented with EH constructs, making them
+difficult to avoid at source level. The recent work on finally cloning, empty
+finally removal, and empty try removal targeted this. [Writethrough](https://github.com/dotnet/coreclr/blob/master/Documentation/design-docs/eh-writethru.md)
+is another key optimization enabler here, and we are actively pursuing it. Other
+things we've discussed include inlining methods with EH and computing funclet
+callee-save register usage independently of main function callee-save register
+usage, but I don't think we have any particular data pointing to either as a
+high priority.
+
+
+Loop Optimizations
+------------------
+
+We haven't been targeting benchmarks that spend a lot of time doing compuations
+in an inner loop. Pursuing loop optimizations for the peanut butter effect
+would seem odd. So this simply hasn't bubbled up in priority yet, though it's
+bound to eventually.
+
+
+More Expression Optimizations
+-----------------------------
+
+We again don't have particular benchmarks pointing to key missing cases, and
+balancing the CQ vs TP will be delicate here, so it would really help to have
+an appropriate benchmark suite to evaluate this work against.
+
+
+Forward Substitution
+--------------------
+
+This too needs an appropriate benchmark suite that I don't think we have at
+this time. The tradeoffs against register pressure increase and throughput
+need to be evaluated. This also might make more sense to do if/when we can
+handle SSA renames.
+
+
+Value Number Conservativism
+---------------------------
+
+We have some frustrating phase-ordering issues resulting from this, but the
+opt-repeat experiment indicated that they're not prevalent enough to merit
+pursuing changing this right now. Also, using SSA def as the proxy for value
+number would require handling SSA renaming, so there's a big dependency chained
+to this.
+Maybe it's worth reconsidering the priority based on throughput?
+
+
+High Tier Optimizations
+-----------------------
+
+We don't have that many knobs we can "crank up" (though we do have the tracked
+assertion count and could switch inliner policies), nor do we have any sort of
+benchmarking story set up to validate whether tiering changes are helping or
+hurting. We should get that benchmarking story sorted out and at least hook
+up those two knobs.
+
+
+Low Tier Back-Off
+-----------------
+
+We have some changes we know we want to make here: morph does more than it needs
+to in minopts, and tier 0 should be doing throughput-improving inlines, as
+opposed to minopts which does no inlining. It would be nice to have the
+benchmarking story set up to measure the effect of such changes when they go in,
+we should do that.
+
+
+Async
+-----
+
+We've made note of the prevalence of async/await in modern code (and particularly
+in web server code such as TechEmpower), and have some opportunities listed in
+[#7914](https://github.com/dotnet/coreclr/issues/7914). Some sort of study of
+async peanut butter to find more opportunities is probably in order, but what
+would that look like?
+
+
+Address Mode Building
+---------------------
+
+One opportunity that's frequently visible in asm dumps is that more address
+expressions could be folded into memory operands' address expressions. This
+would likely give a measurable codesize win. Needs some thought about where
+to run in phase list and how aggressive to be about e.g. analyzing across
+statements.
+
+
+If-Conversion (cmov formation)
+------------------------------
+
+This hits big in microbenchmarks where it hits. There's some work in flight
+on this (see #7447 and #10861).
+
+
+Mulshift
+--------
+
+Replacing multiplication by constants with shift/add/lea sequences is a
+classic optimization that keeps coming up in planning. An [analysis](https://gist.github.com/JosephTremoulet/c1246b17ea2803e93e203b9969ee5a25#file-mulshift-md)
+indicates that RyuJIT is already capitalizing on most of the opportunity here.