summaryrefslogtreecommitdiff
path: root/Documentation
diff options
context:
space:
mode:
authorJiyoung Yun <jy910.yun@samsung.com>2017-02-10 11:35:12 (GMT)
committerJiyoung Yun <jy910.yun@samsung.com>2017-02-10 11:35:12 (GMT)
commit4b11dc566a5bbfa1378d6266525c281b028abcc8 (patch)
treeb48831a898906734f8884d08b6e18f1144ee2b82 /Documentation
parentdb20f3f1bb8595633a7e16c8900fd401a453a6b5 (diff)
downloadcoreclr-4b11dc566a5bbfa1378d6266525c281b028abcc8.zip
coreclr-4b11dc566a5bbfa1378d6266525c281b028abcc8.tar.gz
coreclr-4b11dc566a5bbfa1378d6266525c281b028abcc8.tar.bz2
Imported Upstream version 1.0.0.9910upstream/1.0.0.9910
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/botr/clr-abi.md16
-rw-r--r--Documentation/building/android.md102
-rw-r--r--Documentation/building/cross-building.md25
-rw-r--r--Documentation/building/testing-with-corefx.md25
-rw-r--r--Documentation/building/unix-test-instructions.md3
-rw-r--r--Documentation/design-docs/finally-optimizations.md487
-rw-r--r--Documentation/design-docs/tailcalls-with-helpers.md460
-rw-r--r--Documentation/project-docs/contributing-workflow.md2
-rw-r--r--Documentation/project-docs/glossary.md7
-rw-r--r--Documentation/workflow/IssuesFeedbackEngagement.md12
-rw-r--r--Documentation/workflow/OfficalAndDailyBuilds.md2
11 files changed, 1110 insertions, 31 deletions
diff --git a/Documentation/botr/clr-abi.md b/Documentation/botr/clr-abi.md
index caa5c7a..6719522 100644
--- a/Documentation/botr/clr-abi.md
+++ b/Documentation/botr/clr-abi.md
@@ -50,7 +50,7 @@ Managed varargs are not supported in .NET Core.
## Generics
-*Shared generics*. In cases where the code address does not uniquely identify a generic instantiation of a method, then a 'generic instantiation parameter' is required. Often the "this" pointer can serve dual-purpose as the instantiation parameter. When the "this" pointer is not the generic parameter, the generic parameter is passed as the next argument (after the optional return buffer and the optional "this" pointer, but before any user arguments). For generic methods (where there is a type parameter directly on the method, as compared to the type), the generic parameter currently is a MethodDesc pointer (I believe an InstantiatedMethodDesc). For static methods (where there is no "this" pointer) the generic parameter is a MethodTable pointer/TypeHandle.
+*Shared generics*. In cases where the code address does not uniquely identify a generic instantiation of a method, then a 'generic instantiation parameter' is required. Often the "this" pointer can serve dual-purpose as the instantiation parameter. When the "this" pointer is not the generic parameter, the generic parameter is passed as an additional argument. On ARM, ARM64 and AMD64, it is passed after the optional return buffer and the optional "this" pointer, but before any user arguments. On x86, if all arguments of the function including "this" pointer fit into argument registers (ECX and EDX) and we still have argument registers available, we store the hidden argument in the next available argument register. Otherwise it is passed as the last stack argument. For generic methods (where there is a type parameter directly on the method, as compared to the type), the generic parameter currently is a MethodDesc pointer (I believe an InstantiatedMethodDesc). For static methods (where there is no "this" pointer) the generic parameter is a MethodTable pointer/TypeHandle.
Sometimes the VM asks the JIT to report and keep alive the generics parameter. In this case, it must be saved on the stack someplace and kept alive via normal GC reporting (if it was the "this" pointer, as compared to a MethodDesc or MethodTable) for the entire method except the prolog and epilog. Also note that the code to home it, must be in the range of code reported as the prolog in the GC info (which probably isn't the same as the range of code reported as the prolog in the unwind info).
@@ -139,17 +139,19 @@ This section describes the conventions the JIT needs to follow when generating c
## Funclets
-For non-x86 platforms, all managed EH handlers (finally, fault, filter, filter-handler, and catch) are extracted into their own 'funclets'. To the OS they are treated just like first class functions (separate PDATA and XDATA (`RUNTIME_FUNCTION` entry), etc.). The CLR currently treats them just like part of the parent function in many ways. The main function and all funclets must be allocated in a single code allocation (see hot cold splitting). They 'share' GC info. Only the main function prolog can be hot patched.
+For all platforms except Windows/x86, all managed EH handlers (finally, fault, filter, filter-handler, and catch) are extracted into their own 'funclets'. To the OS they are treated just like first class functions (separate PDATA and XDATA (`RUNTIME_FUNCTION` entry), etc.). The CLR currently treats them just like part of the parent function in many ways. The main function and all funclets must be allocated in a single code allocation (see hot cold splitting). They 'share' GC info. Only the main function prolog can be hot patched.
The only way to enter a handler funclet is via a call. In the case of an exception, the call is from the VM's EH subsystem as part of exception dispatch/unwind. In the non-exceptional case, this is called local unwind or a non-local exit. In C# this is accomplished by simply falling-through/out of a try body or an explicit goto. In IL this is always accomplished via a LEAVE opcode, within a try body, targeting an IL offset outside the try body. In such cases the call is from the JITed code of the parent function.
-For x86, all handlers are generated within the method body, typically in lexical order. A nested try/catch is generated completely within the EH region in which it is nested. These handlers are essentially "in-line funclets", but they do not look like normal functions: they do not have a normal prolog or epilog, although they do have special entry/exit and register conventions. Also, nested handlers are not un-nested as for funclets: the code for a nested handler is generated within the handler in which it is nested.
+For Windows/x86, all handlers are generated within the method body, typically in lexical order. A nested try/catch is generated completely within the EH region in which it is nested. These handlers are essentially "in-line funclets", but they do not look like normal functions: they do not have a normal prolog or epilog, although they do have special entry/exit and register conventions. Also, nested handlers are not un-nested as for funclets: the code for a nested handler is generated within the handler in which it is nested.
## Cloned finallys
JIT64 attempts to speed the normal control flow by 'inlining' a called finally along the 'normal' control flow (i.e., leaving a try body in a non-exceptional manner via C# fall-through). Because the VM semantics for non-rude Thread.Abort dictate that handlers will not be aborted, the JIT must mark these 'inlined' finally bodies. These show up as special entries at the end of the EH tables and are marked with `COR_ILEXCEPTION_CLAUSE_FINALLY | COR_ILEXCEPTION_CLAUSE_DUPLICATED`, and the try_start, try_end, and handler_start are all the same: the start of the cloned finally.
-JIT32 and RyuJIT currently do not implement finally cloning.
+RyuJit also implements finally cloning, for all supported architectures. However, the implementation does not yet handle the thread abort case; cloned finally bodies are not guaranteed to remain intact and are not reported to the runtime. Because of this, finally cloning is disabled for VMs that support thread abort (desktop clr).
+
+JIT32 does not implement finally cloning.
## Invoking Finallys/Non-local exits
@@ -283,7 +285,7 @@ The PSPSym is a pointer-sized local variable in the frame of the main function a
The VM uses the PSPSym to find other locals it cares about (such as the generics context in a funclet frame). The JIT uses it to re-establish the frame pointer register, so that the frame pointer is the same value in a funclet as it is in the main function body.
-When a funclet is called, it is passed the *Establisher Frame Pointer*. For AMD64 this is true for all funclets and it is passed as the first argument in RCX, but for ARM and ARM64 this is only true for first pass funclets (currently just filters) and it is passed as the second argument in R1. The Establisher Frame Pointer is a stack pointer of an interesting "parent" frame in the exception processing system. For the CLR, it points either to the main function frame or a dynamically enclosing funclet frame from the same function, for the funclet being invoked. The value of the Establisher Frame Pointer is Initial-SP on AMD64, Caller-SP on ARM and ARM64.
+When a funclet is called, it is passed the *Establisher Frame Pointer*. For AMD64 this is true for all funclets and it is passed as the first argument in RCX, but for ARM and ARM64 this is only true for first pass funclets (currently just filters) and it is passed as the second argument in R1. The Establisher Frame Pointer is a stack pointer of an interesting "parent" frame in the exception processing system. For the CLR, it points either to the main function frame or a dynamically enclosing funclet frame from the same function, for the funclet being invoked. The value of the Establisher Frame Pointer is Initial-SP on AMD64, Caller-SP on x86, ARM, and ARM64.
Using the establisher frame, the funclet wants to load the value of the PSPSym. Since we don't know if the Establisher Frame is from the main function or a funclet, we design the main function and funclet frame layouts to place the PSPSym at an identical, small, constant offset from the Establisher Frame in each case. (This is also required because we only report a single offset to the PSPSym in the GC information, and that offset must be valid for the main function and all of its funclets). Then, the funclet uses this known offset to compute the PSPSym address and read its value. From this, it can compute the value of the frame pointer (which is a constant offset from the PSPSym value) and set the frame register to be the same as the parent function. Also, the funclet writes the value of the PSPSym to its own frame's PSPSym. This "copying" of the PSPSym happens for every funclet invocation, in particular, for every nested funclet invocation.
@@ -331,9 +333,9 @@ When a funclet finishes execution, and the VM returns execution to the function
Any register value changes made in the funclet are lost. If a funclet wants to make a variable change known to the main function (or the funclet that contains the "try" region), that variable change needs to be made to the shared main function stack frame.
-## x86 EH considerations
+## Windows/x86 EH considerations
-The x86 model is somewhat different than the non-x86 model. X86-specific concerns are mentioned here.
+The Windows/x86 model is somewhat different than non-Windows/x86 model. Windows/X86-specific concerns are mentioned here.
### catch / filter-handler regions
diff --git a/Documentation/building/android.md b/Documentation/building/android.md
new file mode 100644
index 0000000..cfb509d
--- /dev/null
+++ b/Documentation/building/android.md
@@ -0,0 +1,102 @@
+Cross Compilation for Android on Linux
+======================================
+
+Through cross compilation, on Linux it is possible to build CoreCLR for arm64 Android.
+
+Requirements
+------------
+
+You'll need to generate a toolchain and a sysroot for Android. There's a script which takes care of the required steps.
+
+Generating the rootfs
+---------------------
+
+To generate the rootfs, run the following command in the `coreclr` folder:
+
+```
+cross/init-android-rootfs.sh
+```
+
+This will download the NDK and any packages required to compile Android on your system. It's over 1 GB of data, so it may take a while.
+
+
+Cross compiling CoreCLR
+-----------------------
+Once the rootfs has been generated, it will be possible to cross compile CoreCLR.
+
+When cross compiling, you need to set both the `CONFIG_DIR` and `ROOTFS_DIR` variables.
+
+To compile for arm64, run:
+
+```
+CONFIG_DIR=`realpath cross/android/arm64` ROOTFS_DIR=`realpath cross/android-rootfs/toolchain/arm64/sysroot` ./build.sh cross arm64 skipgenerateversion skipmscorlib cmakeargs -DENABLE_LLDBPLUGIN=0
+```
+
+The resulting binaries will be found in `bin/Product/Linux.BuildArch.BuildType/`
+
+Running the PAL tests on Android
+--------------------------------
+
+You can run the PAL tests on an Android device. To run the tests, you first copy the PAL tests to your Android phone using
+`adb`, and then run them in an interactive Android shell using `adb shell`:
+
+To copy the PAL tests over to an Android phone:
+```
+adb push bin/obj/Linux.arm64.Debug/src/pal/tests/palsuite/ /data/local/tmp/coreclr/src/pal/tests/palsuite
+adb push cross/android/toolchain/arm64/sysroot/usr/lib/libuuid.so.1 /data/local/tmp/coreclr/lib
+adb push cross/android/toolchain/arm64/sysroot/usr/lib/libintl.so /data/local/tmp/coreclr/lib
+adb push cross/android/toolchain/arm64/sysroot/usr/lib/libandroid-support.so /data/local/tmp/coreclr/lib/
+adb push cross/android/toolchain/arm64/sysroot/usr/lib/libandroid-glob.so /data/local/tmp/coreclr/lib/
+adb push src/pal/tests/palsuite/paltestlist.txt /data/local/tmp/coreclr
+adb push src/pal/tests/palsuite/runpaltests.sh /data/local/tmp/coreclr/
+```
+
+Then, use `adb shell` to launch a shell on Android. Inside that shell, you can launch the PAL tests:
+```
+LD_LIBRARY_PATH=/data/local/tmp/coreclr/lib ./runpaltests.sh /data/local/tmp/coreclr/
+```
+
+Debugging coreclr on Android
+----------------------------
+
+You can debug coreclr on Android using a remote lldb server which you run on your Android device.
+
+First, push the lldb server to Android:
+
+```
+adb push cross/android/lldb/2.2/android/arm64-v8a/lldb-server /data/local/tmp/
+```
+
+Then, launch the lldb server on the Android device. Open a shell using `adb shell` and run:
+
+```
+adb shell
+cd /data/local/tmp
+./lldb-server platform --listen *:1234
+```
+
+After that, you'll need to forward port 1234 from your Android device to your PC:
+```
+adb forward tcp:1234 tcp:1234
+```
+
+Finally, install lldb on your PC and connect to the debug server running on your Android device:
+
+```
+lldb-3.9
+(lldb) platform select remote-android
+ Platform: remote-android
+ Connected: no
+(lldb) platform connect connect://localhost:1234
+ Platform: remote-android
+ Triple: aarch64-*-linux-android
+OS Version: 23.0.0 (3.10.84-perf-gf38969a)
+ Kernel: #1 SMP PREEMPT Fri Sep 16 11:29:29 2016
+ Hostname: localhost
+ Connected: yes
+WorkingDir: /data/local/tmp
+
+(lldb) target create coreclr/pal/tests/palsuite/file_io/CopyFileA/test4/paltest_copyfilea_test4
+(lldb) env LD_LIBRARY_PATH=/data/local/tmp/coreclr/lib
+(lldb) run
+```
diff --git a/Documentation/building/cross-building.md b/Documentation/building/cross-building.md
index ab5897a..30c7aca 100644
--- a/Documentation/building/cross-building.md
+++ b/Documentation/building/cross-building.md
@@ -21,11 +21,12 @@ and conversely for arm64:
Generating the rootfs
---------------------
-The `cross\build-rootfs.sh` script can be used to download the files needed for cross compilation. It will generate an Ubuntu 14.04 rootfs as this is what CoreCLR targets.
+The `cross\build-rootfs.sh` script can be used to download the files needed for cross compilation. It will generate an rootfs as this is what CoreCLR targets.
- Usage: build-rootfs.sh [BuildArch] [UbuntuCodeName]
- BuildArch can be: arm, arm-softfp, arm64
- UbuntuCodeName - optional, Code name for Ubuntu, can be: trusty(default), vivid, wily
+ Usage: ./cross/build-rootfs.sh [BuildArch] [LinuxCodeName] [lldbx.y] [--skipunmount]
+ BuildArch can be: arm(default), armel, arm64, x86
+ LinuxCodeName - optional, Code name for Linux, can be: trusty(default), vivid, wily, xenial. If BuildArch is armel, LinuxCodeName is jessie(default) or tizen.
+ lldbx.y - optional, LLDB version, can be: lldb3.6(default), lldb3.8
The `build-rootfs.sh` script must be run as root as it has to make some symlinks to the system, it will by default generate the rootfs in `cross\rootfs\<BuildArch>` however this can be changed by setting the `ROOTFS_DIR` environment variable.
@@ -33,7 +34,7 @@ For example, to generate an arm rootfs:
ben@ubuntu ~/git/coreclr/ $ sudo ./cross/build-rootfs.sh arm
-You can choose Ubuntu code name to match your target, give `vivid` for `15.04`, `wily` for `15.10`. Default is `trusty`, version `14.04`.
+You can choose Linux code name to match your target, give `vivid` for `Ubuntu 15.04`, `wily` for `Ubuntu 15.10`. Default is `trusty`, version `Ubuntu 14.04`.
ben@ubuntu ~/git/coreclr/ $ sudo ./cross/build-rootfs.sh arm wily
@@ -41,6 +42,18 @@ and if you wanted to generate the rootfs elsewhere:
ben@ubuntu ~/git/coreclr/ $ sudo ROOTFS_DIR=/home/ben/coreclr-cross/arm ./cross/build-rootfs.sh arm
+For example, to generate an armel rootfs:
+
+ hqu@ubuntu ~/git/coreclr/ $ sudo ./cross/build-rootfs.sh armel
+
+You can choose code name to match your target, give `jessie` for `Debian`, `tizen` for `Tizen`. Default is `jessie`.
+
+ hque@ubuntu ~/git/coreclr/ $ sudo ./cross/build-rootfs.sh armel tizen
+
+and if you wanted to generate the rootfs elsewhere:
+
+ hque@ubuntu ~/git/coreclr/ $ sudo ROOTFS_DIR=/home/ben/coreclr-cross/armel ./cross/build-rootfs.sh armel tizen
+
Cross compiling CoreCLR
-----------------------
@@ -117,7 +130,7 @@ prajwal@ubuntu ~/coreclr $ ./tests/scripts/arm32_ci_script.sh \
--skipTests
```
-The Linux ARM Emulator is based on soft floating point and thus the native binaries in coreclr are built for the arm-softfp architecture. The coreclr binaries generated by the above command (native and mscorlib) can be found at `~/coreclr/bin/Product/Linux.arm-softfp.Release`.
+The Linux ARM Emulator is based on soft floating point and thus the native binaries in coreclr are built for the armel architecture. The coreclr binaries generated by the above command (native and mscorlib) can be found at `~/coreclr/bin/Product/Linux.armel.Release`.
To build libcoreclr and mscorlib, and run selected coreclr unit tests on the emulator, do the following:
* Download the latest Coreclr unit test binaries (or build on Windows) from here: [Debug](http://dotnet-ci.cloudapp.net/job/dotnet_coreclr/job/master/job/debug_windows_nt_bld/lastSuccessfulBuild/artifact/bin/tests/tests.zip) and [Release](http://dotnet-ci.cloudapp.net/job/dotnet_coreclr/job/master/job/release_windows_nt_bld/lastSuccessfulBuild/artifact/bin/tests/tests.zip).
diff --git a/Documentation/building/testing-with-corefx.md b/Documentation/building/testing-with-corefx.md
index 4f9886f..defc8f8 100644
--- a/Documentation/building/testing-with-corefx.md
+++ b/Documentation/building/testing-with-corefx.md
@@ -3,18 +3,23 @@ Testing with CoreFX
It may be valuable to use CoreFX tests to validate your changes to CoreCLR or mscorlib.
-**Windows**
+**NOTE:** The `BUILDTOOLS_OVERRIDE_RUNTIME` property no longer works.
-As part of building tests, CoreFX restores a copy of the runtime from myget, in order to update the runtime that is deployed, a special build property `BUILDTOOLS_OVERRIDE_RUNTIME` can be used. If this is set, the CoreFX testing targets will copy all the files in the folder it points to into the test folder, overwriting any files that exist.
+**Replace runtime between build.[cmd|sh] and build-tests.[cmd|sh]**
-To run tests, follow the procedure for [running tests in CoreFX](https://github.com/dotnet/corefx/blob/master/Documentation/building/windows-instructions.md). You can pass `/p:BUILDTOOLS_OVERRIDE_RUNTIME=<path-to-coreclr>\bin\Product\Windows_NT.x64.Release` to build.cmd to set this property, e.g. (note the space between the "--" and the "/p" option):
+Use the following instructions to test a change to the dotnet/coreclr repo using dotnet/corefx tests. Refer to the [CoreFx Developer Guide](https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/developer-guide.md) for information about CoreFx build scripts.
-```
-build.cmd -Release -- /p:BUILDTOOLS_OVERRIDE_RUNTIME=<root of coreclr repo>\bin\Product\Windows_NT.x64.Checked
-```
+1. Build the CoreClr runtime you wish to test under `<coreclr_root>`
+2. Build the CoreFx repo (`build.[cmd|sh]`) under `<corefx_root>`, but don't build tests yet
+3. Copy the contents of the CoreCLR binary root you wish to test into the CoreFx runtime folder (`<flavor>` below) created in step #2. For example:
-**FreeBSD, Linux, NetBSD, OS X**
+ `copy <coreclr_root>\bin\Product\Windows_NT.<arch>.<build_type>\* <corefx_root>\bin\runtime\<flavor>`
+ -or-
+ `cp <coreclr_root>/bin/Product/<os>.<arch>.<build_type>/* <corefx_root>/bin/runtime/<flavor>`
+
+4. Run the CoreFx `build-tests.[cmd|sh]` script as described in the Developer Guide.
+
+**CI Script**
+
+[run-corefx-tests.py](https://github.com/dotnet/coreclr/blob/master/tests/scripts/run-corefx-tests.py) will clone dotnet/corefx and run steps 2-4 above automatically. It is primarily intended to be run by the dotnet/coreclr CI system, but it might provide a useful reference or shortcut for individuals running the tests locally.
-Refer to the procedure for [running tests in CoreFX](https://github.com/dotnet/corefx/blob/master/Documentation/building/cross-platform-testing.md)
-- Note the --coreclr-bins and --mscorlib-bins arguments to [run-test.sh](https://github.com/dotnet/corefx/blob/master/run-test.sh)
-- Pass in paths to your private build of CoreCLR
diff --git a/Documentation/building/unix-test-instructions.md b/Documentation/building/unix-test-instructions.md
index 9cf7507..563c3e8 100644
--- a/Documentation/building/unix-test-instructions.md
+++ b/Documentation/building/unix-test-instructions.md
@@ -36,8 +36,7 @@ Run tests (`Debug` may be replaced with `Release` or `Checked`, depending on whi
> --testNativeBinDir=~/coreclr/bin/obj/Linux.x64.Debug/tests
> --coreClrBinDir=~/coreclr/bin/Product/Linux.x64.Debug
> --mscorlibDir=/media/coreclr/bin/Product/Linux.x64.Debug
-> --coreFxBinDir="~/corefx/bin/Linux.AnyCPU.Debug;~/corefx/bin/Unix.AnyCPU.Debug;~/corefx/bin/AnyOS.AnyCPU.Debug"
-> --coreFxNativeBinDir=~/corefx/bin/Linux.x64.Debug
+> --coreFxBinDir=~/corefx/bin/runtime/netcoreapp-Linux-Debug-x64
> ```
The method above will copy dependencies from the set of directories provided to create an 'overlay' directory.
diff --git a/Documentation/design-docs/finally-optimizations.md b/Documentation/design-docs/finally-optimizations.md
new file mode 100644
index 0000000..d35d5a4
--- /dev/null
+++ b/Documentation/design-docs/finally-optimizations.md
@@ -0,0 +1,487 @@
+Finally Optimizations
+=====================
+
+In MSIL, a try-finally is a construct where a block of code
+(the finally) is guaranteed to be executed after control leaves a
+protected region of code (the try) either normally or via an
+exception.
+
+In RyuJit a try-finally is currently implemented by transforming the
+finally into a local function that is invoked via jitted code at normal
+exits from the try block and is invoked via the runtime for exceptional
+exits from the try block.
+
+For x86 the local function is simply a part of the method and shares
+the same stack frame with the method. For other architectures the
+local function is promoted to a potentially separable "funclet"
+which is almost like a regular function with a prolog and epilog. A
+custom calling convention gives the funclet access to the parent stack
+frame.
+
+In this proposal we outline three optimizations for finallys: removing
+empty trys, removing empty finallys and finally cloning.
+
+Empty Finally Removal
+---------------------
+
+An empty finally is one that has no observable effect. These often
+arise from `foreach` or `using` constructs (which induce a
+try-finally) where the cleanup method called in the finally does
+nothing. Often, after inlining, the empty finally is readily apparent.
+
+For example, this snippet of C# code
+```C#
+static int Sum(List<int> x) {
+ int sum = 0;
+ foreach(int i in x) {
+ sum += i;
+ }
+ return sum;
+}
+```
+produces the following jitted code:
+```asm
+; Successfully inlined Enumerator[Int32][System.Int32]:Dispose():this
+; (1 IL bytes) (depth 1) [below ALWAYS_INLINE size]
+G_M60484_IG01:
+ 55 push rbp
+ 57 push rdi
+ 56 push rsi
+ 4883EC50 sub rsp, 80
+ 488D6C2460 lea rbp, [rsp+60H]
+ 488BF1 mov rsi, rcx
+ 488D7DD0 lea rdi, [rbp-30H]
+ B906000000 mov ecx, 6
+ 33C0 xor rax, rax
+ F3AB rep stosd
+ 488BCE mov rcx, rsi
+ 488965C0 mov qword ptr [rbp-40H], rsp
+
+G_M60484_IG02:
+ 33C0 xor eax, eax
+ 8945EC mov dword ptr [rbp-14H], eax
+ 8B01 mov eax, dword ptr [rcx]
+ 8B411C mov eax, dword ptr [rcx+28]
+ 33D2 xor edx, edx
+ 48894DD0 mov gword ptr [rbp-30H], rcx
+ 8955D8 mov dword ptr [rbp-28H], edx
+ 8945DC mov dword ptr [rbp-24H], eax
+ 8955E0 mov dword ptr [rbp-20H], edx
+
+G_M60484_IG03:
+ 488D4DD0 lea rcx, bword ptr [rbp-30H]
+ E89B35665B call Enumerator[Int32][System.Int32]:MoveNext():bool:this
+ 85C0 test eax, eax
+ 7418 je SHORT G_M60484_IG05
+
+; Body of foreach loop
+
+G_M60484_IG04:
+ 8B4DE0 mov ecx, dword ptr [rbp-20H]
+ 8B45EC mov eax, dword ptr [rbp-14H]
+ 03C1 add eax, ecx
+ 8945EC mov dword ptr [rbp-14H], eax
+ 488D4DD0 lea rcx, bword ptr [rbp-30H]
+ E88335665B call Enumerator[Int32][System.Int32]:MoveNext():bool:this
+ 85C0 test eax, eax
+ 75E8 jne SHORT G_M60484_IG04
+
+; Normal exit from the implicit try region created by `foreach`
+; Calls the finally to dispose of the iterator
+
+G_M60484_IG05:
+ 488BCC mov rcx, rsp
+ E80C000000 call G_M60484_IG09 // call to finally
+
+G_M60484_IG06:
+ 90 nop
+
+G_M60484_IG07:
+ 8B45EC mov eax, dword ptr [rbp-14H]
+
+G_M60484_IG08:
+ 488D65F0 lea rsp, [rbp-10H]
+ 5E pop rsi
+ 5F pop rdi
+ 5D pop rbp
+ C3 ret
+
+; Finally funclet. Note it simply sets up and then tears down a stack
+; frame. The dispose method was inlined and is empty.
+
+G_M60484_IG09:
+ 55 push rbp
+ 57 push rdi
+ 56 push rsi
+ 4883EC30 sub rsp, 48
+ 488B6920 mov rbp, qword ptr [rcx+32]
+ 48896C2420 mov qword ptr [rsp+20H], rbp
+ 488D6D60 lea rbp, [rbp+60H]
+
+G_M60484_IG10:
+ 4883C430 add rsp, 48
+ 5E pop rsi
+ 5F pop rdi
+ 5D pop rbp
+ C3 ret
+```
+
+In such cases the try-finally can be removed, leading to code like the following:
+```asm
+G_M60484_IG01:
+ 57 push rdi
+ 56 push rsi
+ 4883EC38 sub rsp, 56
+ 488BF1 mov rsi, rcx
+ 488D7C2420 lea rdi, [rsp+20H]
+ B906000000 mov ecx, 6
+ 33C0 xor rax, rax
+ F3AB rep stosd
+ 488BCE mov rcx, rsi
+
+G_M60484_IG02:
+ 33F6 xor esi, esi
+ 8B01 mov eax, dword ptr [rcx]
+ 8B411C mov eax, dword ptr [rcx+28]
+ 48894C2420 mov gword ptr [rsp+20H], rcx
+ 89742428 mov dword ptr [rsp+28H], esi
+ 8944242C mov dword ptr [rsp+2CH], eax
+ 89742430 mov dword ptr [rsp+30H], esi
+
+G_M60484_IG03:
+ 488D4C2420 lea rcx, bword ptr [rsp+20H]
+ E8A435685B call Enumerator[Int32][System.Int32]:MoveNext():bool:this
+ 85C0 test eax, eax
+ 7414 je SHORT G_M60484_IG05
+
+G_M60484_IG04:
+ 8B4C2430 mov ecx, dword ptr [rsp+30H]
+ 03F1 add esi, ecx
+ 488D4C2420 lea rcx, bword ptr [rsp+20H]
+ E89035685B call Enumerator[Int32][System.Int32]:MoveNext():bool:this
+ 85C0 test eax, eax
+ 75EC jne SHORT G_M60484_IG04
+
+G_M60484_IG05:
+ 8BC6 mov eax, esi
+
+G_M60484_IG06:
+ 4883C438 add rsp, 56
+ 5E pop rsi
+ 5F pop rdi
+ C3 ret
+```
+
+Empty finally removal is unconditionally profitable: it should always
+reduce code size and improve code speed.
+
+Empty Try Removal
+---------------------
+
+If the try region of a try-finally is empty, and the jitted code will
+execute on a runtime that does not protect finally execution from
+thread abort, then the try-finally can be replaced with just the
+content of the finally.
+
+Empty trys with non-empty finallys often exist in code that must run
+under both thread-abort aware and non-thread-abort aware runtimes. In
+the former case the placement of cleanup code in the finally ensures
+that the cleanup code will execute fully. But if thread abort is not
+possible, the extra protection offered by the finally is not needed.
+
+Empty try removal looks for try-finallys where the try region does
+nothing except invoke the finally. There are currently two different
+EH implementation models, so the try screening has two cases:
+
+* callfinally thunks (x64/arm64): the try must be a single empty
+basic block that always jumps to a callfinally that is the first
+half of a callfinally/always pair;
+* non-callfinally thunks (x86/arm32): the try must be a
+callfinally/always pair where the first block is an empty callfinally.
+
+The screening then verifies that the callfinally identified above is
+the only callfinally for the try. No other callfinallys are expected
+because this try cannot have multiple leaves and its handler cannot be
+reached by nested exit paths.
+
+When the empty try is identified, the jit modifies the
+callfinally/always pair to branch to the handler, modifies the
+handler's return to branch directly to the continuation (the
+branch target of the second half of the callfinally/always pair),
+updates various status flags on the blocks, and then removes the
+try-finally region.
+
+Finally Cloning
+---------------
+
+Finally cloning is an optimization where the jit duplicates the code
+in the finally for one or more of the normal exit paths from the try,
+and has those exit points branch to the duplicated code directly,
+rather than calling the finally. This transformation allows for
+improved performance and optimization of the common case where the try
+completes without an exception.
+
+Finally cloning also allows hot/cold splitting of finally bodies: the
+cloned finally code covers the normal try exit paths (the hot cases)
+and can be placed in the main method region, and the original finally,
+now used largely or exclusively for exceptional cases (the cold cases)
+spilt off into the cold code region. Without cloning, RyuJit
+would always treat the finally as cold code.
+
+Finally cloning will increase code size, though often the size
+increase is mitigated somewhat by more compact code generation in the
+try body and streamlined invocation of the cloned finallys.
+
+Try-finally regions may have multiple normal exit points. For example
+the following `try` has two: one at the `return 3` and one at the try
+region end:
+
+```C#
+try {
+ if (p) return 3;
+ ...
+}
+finally {
+ ...
+}
+return 4;
+```
+
+Here the finally must be executed no matter how the try exits. So
+there are to two normal exit paths from the try, both of which pass
+through the finally but which then diverge. The fact that some try
+regions can have multiple exits opens the potential for substantial
+code growth from finally cloning, and so leads to a choice point in
+the implementation:
+
+* Share the clone along all exit paths
+* Share the clone along some exit paths
+* Clone along all exit paths
+* Clone along some exit paths
+* Only clone along one exit path
+* Only clone when there is one exit path
+
+The shared clone option must essentially recreate or simulate the
+local call mechanism for the finally, though likely somewhat more
+efficiently. Each exit point must designate where control should
+resume once the shared finally has finished. For instance the jit
+could introduce a new local per try-finally to determine where the
+cloned finally should resume, and enumerate the possibilities using a
+small integer. The end of the cloned finally would then use a switch
+to determine what code to execute next. This has the downside of
+introducing unrealizable paths into the control flow graph.
+
+Cloning along all exit paths can potentially lead to large amounts of
+code growth.
+
+Cloning along some paths or only one path implies that some normal
+exit paths won't be as well optimized. Nonetheless cloning along one
+path was the choice made by JIT64 and the one we recommend for
+implementation. In particular we suggest only cloning along the end of
+try region exit path, so that any early exit will continue to invoke
+the funclet for finally cleanup (unless that exit happens to have the
+same post-finally continuation as the end try region exit, in which
+case it can simply jump to the cloned finally).
+
+One can imagine adaptive strategies. The size of the finally can
+be roughly estimated and the number of clones needed for full cloning
+readily computed. Selective cloning can be based on profile
+feedback or other similar mechanisms for choosing the profitable
+cases.
+
+The current implementation will clone the finally and retarget the
+last (largest IL offset) leave in the try region to the clone. Any
+other leave that ultimately transfers control to the same post-finally
+offset will also be modified to jump to the clone.
+
+Empirical studies have shown that most finallys are small. Thus to
+avoid excessive code growth, a crude size estimate is formed by
+counting the number of statements in the blocks that make up the
+finally. Any finally larger that 15 statements is not cloned. In our
+study this disqualifed about 0.5% of all finallys from cloning.
+
+### EH Nesting Considerations
+
+Finally cloning is also more complicated when the finally encloses
+other EH regions, since the clone will introduce copies of all these
+regions. While it is possible to implement cloning in such cases we
+propose to defer for now.
+
+Finally cloning is also a bit more complicated if the finally is
+enclosed by another finally region, so we likewise propose deferring
+support for this. (Seems like a rare enough thing but maybe not too
+hard to handle -- though possibly not worth it if we're not going to
+support the enclosing case).
+
+### Control-Flow and Other Considerations
+
+If the try never exits normally, then the finally can only be invoked
+in exceptional cases. There is no benefit to cloning since the cloned
+finally would be unreachable. We can detect a subset of such cases
+because there will be no call finally blocks.
+
+JIT64 does not clone finallys that contained switch. We propose to
+do likewise. (Initially I did not include this restriction but
+hit a failing test case where the finally contained a switch. Might be
+worth a deeper look, though such cases are presumably rare.)
+
+If the finally never exits normally, then we presume it is cold code,
+and so will not clone.
+
+If the finally is marked as run rarely, we will not clone.
+
+Implementation Proposal
+-----------------------
+
+We propose that empty finally removal and finally cloning be run back
+to back, spliced into the phase list just after fgInline and
+fgAddInternal, and just before implicit by-ref and struct
+promotion. We want to run these early before a lot of structural
+invariants regarding EH are put in place, and before most
+other optimization, but run them after inlining
+(so empty finallys can be more readily identified) and after the
+addition of implicit try-finallys created by the jit. Empty finallys
+may arise later because of optimization, but this seems relatively
+uncommon.
+
+We will remove empty finallys first, then clone.
+
+Neither optimization will run when the jit is generating debuggable
+code or operating in min opts mode.
+
+### Empty Finally Removal (Sketch)
+
+Skip over methods that have no EH, are compiled with min opts, or
+where the jit is generating debuggable code.
+
+Walk the handler table, looking for try-finally (we could also look
+for and remove try-faults with empty faults, but those are presumably
+rare).
+
+If the finally is a single block and contains only a `retfilter`
+statement, then:
+
+* Retarget the callfinally(s) to jump always to the continuation blocks.
+* Remove the paired jump always block(s) (note we expect all finally
+calls to be paired since the empty finally returns).
+* For funclet EH models with finally target bits, clear the finally
+target from the continuations.
+* For non-funclet EH models only, clear out the GT_END_LFIN statement
+in the finally continuations.
+* Remove the handler block.
+* Reparent all directly contained try blocks to the enclosing try region
+or to the method region if there is no enclosing try.
+* Remove the try-finally from the EH table via `fgRemoveEHTableEntry`.
+
+After the walk, if any empty finallys were removed, revalidate the
+integrity of the handler table.
+
+### Finally Cloning (Sketch)
+
+Skip over all methods, if the runtime suports thread abort. More on
+this below.
+
+Skip over methods that have no EH, are compiled with min opts, or
+where the jit is generating debuggable code.
+
+Walk the handler table, looking for try-finally. If the finally is
+enclosed in a handler or encloses another handler, skip.
+
+Walk the finally body blocks. If any is BBJ_SWITCH, or if none
+is BBJ_EHFINALLYRET, skip cloning. If all blocks are RunRarely
+skip cloning. If the finally has more that 15 statements, skip
+cloning.
+
+Walk the try region from back to front (from largest to smallest IL
+offset). Find the last block in the try that invokes the finally. That
+will be the path that will invoke the clone.
+
+If the EH model requires callfinally thunks, and there are multiple
+thunks that invoke the finally, and the callfinally thunk along the
+clone path is not the first, move it to the front (this helps avoid
+extra jumps).
+
+Set the insertion point to just after the callfinally in the path (for
+thunk models) or the end of the try (for non-thunk models). Set up a
+block map. Clone the finally body using `fgNewBBinRegion` and
+`fgNewBBafter` to make the first and subsequent blocks, and
+`CloneBlockState` to fill in the block contents. Clear the handler
+region on the cloned blocks. Bail out if cloning fails. Mark the first
+and last cloned blocks with appropriate BBF flags. Patch up inter-clone
+branches and convert the returns into jumps to the continuation.
+
+Walk the callfinallys, retargeting the ones that return to the
+continuation so that they invoke the clone. Remove the paired always
+blocks. Clear the finally target bit and any GT_END_LFIN from the
+continuation.
+
+If all call finallys are converted, modify the region to be try/fault
+(interally EH_HANDLER_FAULT_WAS_FINALLY, so we can distinguish it
+later from "organic" try/faults). Otherwise leave it as a
+try/finally.
+
+Clear the catch type on the clone entry.
+
+### Thread Abort
+
+For runtimes that support thread abort (desktop), more work is
+required:
+
+* The cloned finally must be reported to the runtime. Likely this
+can trigger off of the BBF_CLONED_FINALLY_BEGIN/END flags.
+* The jit must maintain the integrity of the clone by not losing
+track of the blocks involved, and not allowing code to move in our
+out of the cloned region
+
+Code Size Impact
+----------------
+
+Code size impact from finally cloning was measured for CoreCLR on
+Windows x64.
+
+```
+Total bytes of diff: 16158 (0.12 % of base)
+ diff is a regression.
+Total byte diff includes 0 bytes from reconciling methods
+ Base had 0 unique methods, 0 unique bytes
+ Diff had 0 unique methods, 0 unique bytes
+Top file regressions by size (bytes):
+ 3518 : Microsoft.CodeAnalysis.CSharp.dasm (0.16 % of base)
+ 1895 : System.Linq.Expressions.dasm (0.32 % of base)
+ 1626 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.07 % of base)
+ 1428 : System.Threading.Tasks.Parallel.dasm (4.66 % of base)
+ 1248 : System.Linq.Parallel.dasm (0.20 % of base)
+Top file improvements by size (bytes):
+ -4529 : System.Private.CoreLib.dasm (-0.14 % of base)
+ -975 : System.Reflection.Metadata.dasm (-0.28 % of base)
+ -239 : System.Private.Uri.dasm (-0.27 % of base)
+ -104 : System.Runtime.InteropServices.RuntimeInformation.dasm (-3.36 % of base)
+ -99 : System.Security.Cryptography.Encoding.dasm (-0.61 % of base)
+57 total files with size differences.
+Top method regessions by size (bytes):
+ 645 : System.Diagnostics.Process.dasm - System.Diagnostics.Process:StartCore(ref):bool:this
+ 454 : Microsoft.CSharp.dasm - Microsoft.CSharp.RuntimeBinder.Semantics.ExpressionBinder:AdjustCallArgumentsForParams(ref,ref,ref,ref,ref,byref):this
+ 447 : System.Threading.Tasks.Dataflow.dasm - System.Threading.Tasks.Dataflow.Internal.SpscTargetCore`1[__Canon][System.__Canon]:ProcessMessagesLoopCore():this
+ 421 : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.Symbols.ImplementsHelper:FindExplicitlyImplementedMember(ref,ref,ref,ref,ref,ref,byref):ref
+ 358 : System.Private.CoreLib.dasm - System.Threading.TimerQueueTimer:Change(int,int):bool:this
+Top method improvements by size (bytes):
+ -2512 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_CLRtoWinRT():ref:this (68 methods)
+ -824 : Microsoft.CodeAnalysis.dasm - Microsoft.Cci.PeWriter:WriteHeaders(ref,ref,ref,ref,byref):this
+ -663 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_CLRtoWinRT(ref):int:this (17 methods)
+ -627 : System.Private.CoreLib.dasm - System.Diagnostics.Tracing.ManifestBuilder:CreateManifestString():ref:this
+ -546 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_WinRTtoCLR(long):int:this (67 methods)
+3014 total methods with size differences.
+```
+
+The largest growth is seen in `Process:StartCore`, which has 4
+try-finally constructs.
+
+Diffs generally show improved codegen in the try bodies with cloned
+finallys. However some of this improvement comes from more aggressive
+use of callee save registers, and this causes size inflation in the
+funclets (note finally cloning does not alter the number of
+funclets). So if funclet save/restore could be contained to registers
+used in the funclet, the size impact would be slightly smaller.
+
+There are also some instances where cloning relatively small finallys
+leads to large code size increases. xxx is one example.
diff --git a/Documentation/design-docs/tailcalls-with-helpers.md b/Documentation/design-docs/tailcalls-with-helpers.md
new file mode 100644
index 0000000..e23b51e
--- /dev/null
+++ b/Documentation/design-docs/tailcalls-with-helpers.md
@@ -0,0 +1,460 @@
+# The current way of handling tail-calls
+## Fast tail calls
+These are tail calls that are handled directly by the jitter and no runtime cooperation is needed. They are limited to cases where:
+* Return value and call target arguments are all either primitive types, reference types, or valuetypes with a single primitive type or reference type fields
+* The aligned size of call target arguments is less or equal to aligned size of caller arguments
+
+## Tail calls using a helper
+Tail calls in cases where we cannot perform the call in a simple way are implemented using a tail call helper. Here is a rough description of how it works:
+* For each tail call target, the jitter asks runtime to generate an assembler argument copying routine. This routine reads vararg list of arguments and places the arguments in their proper slots in the CONTEXT or on the stack. Together with the argument copying routine, the runtime also builds a list of offsets of references and byrefs for return value of reference type or structs returned in a hidden return buffer and for structs passed by ref. The gc layout data block is stored at the end of the argument copying thunk.
+* At the time of the tail call, the caller generates a vararg list of all arguments of the tail called function and then calls JIT_TailCall runtime function. It passes it the copying routine address, the target address and the vararg list of the arguments.
+* The JIT_TailCall then performs the following:
+ * It calls RtlVirtualUnwind twice to get the context of the caller of the caller of the tail call to simulate the effect of running epilog of the caller of the tail call and also its return.
+ * It prepares stack space for the callee stack arguments, a helper explicit TailCallFrame and also a CONTEXT structure where the argument registers of the callee, stack pointer and the target function address are set. In case the tail call caller has enough space for the callee arguments and the TailCallFrame in its stack frame, that space is used directly for the callee arguments. Otherwise the stack arguments area is allocated at the top of the stack. This slightly differs in the case the tail call was called from another tail called function - the TailCallFrame already exists and so it is not recreated. The TailCallFrame also keeps a pointer to the list of gc reference offsets of the arguments and structure return buffer members. The stack walker during GC then uses that to ensure proper GC liveness of those references.
+ * It calls the copying routine to translate the arguments from the vararg list to the just reserved stack area and the context.
+ * If the stack arguments and TailCaillFrame didn't fit into the caller's stack frame, these data are now moved to the final location
+ * RtlRestoreContext is used to start executing the callee.
+
+There are several issues with this approach:
+* It is expensive to port to new platforms
+ * Parsing the vararg list is not possible to do in a portable way on Unix. Unlike on Windows, the list is not stored a linear sequence of the parameter data bytes in memory. va_list on Unix is an opaque data type, some of the parameters can be in registers and some in the memory.
+ * Generating the copying asm routine needs to be done for each target architecture / platform differently. And it is also very complex, error prone and impossible to do on platforms where code generation at runtime is not allowed.
+* It is slower than it has to be
+ * The parameters are copied possibly twice - once from the vararg list to the stack and then one more time if there was not enough space in the caller's stack frame.
+ * RtlRestoreContext restores all registers from the CONTEXT structure, not just a subset of them that is really necessary for the functionality, so it results in another unnecessary memory accesses.
+* Stack walking over the stack frames of the tail calls requires runtime assistance.
+
+# The new approach to tail calls using helpers
+## Objectives
+The new way of handling tail calls using helpers was designed with the following objectives:
+* It should be cheap to port to new platforms, architectures and code generators
+* It needs to work in both jitted and AOT compiled scenarios
+* It should support platforms where runtime code generation is not possible
+* The tail calls should be reasonably fast compared to regular calls with the same arguments
+* The tail calls should not be slower than existing mechanism on Windows
+* No runtime assistance should be necessary for unwinding stack with tail call frames on it
+* The stack should be unwindable at any spot during the tail calls to properly support sampling profilers and similar tools.
+* Stack walk during GC must be able to always correctly report GC references.
+* It should work in all cases except those where a tail call is not allowed as described in the ECMA 335 standard section III.2.4
+
+## Requirements
+* The code generator needs to be able to compile a tail call to a target as a call to a thunk with the same parameters as the target, but void return, followed by a jump to an assembler helper.
+
+## Implementation
+This section describes the helper functions and data structures that the tail calls use and also describes the tail call sequence step by step.
+### Helper functions
+The tail calls use the following thunks and helpers:
+* StoreArguments - this thunk stores the arguments into a thread local storage together with the address of the corresponding CallTarget thunk and a descriptor of locations and types of managed references in the stored arguments data. This thunk is generated as IL and compiled by the jitter or AOT compiler. There is one such thunk per tail call target.
+It has a signature that is compatible with the tailcall target function except for the return type which is void. But it is not the same. It gets the same arguments as the tail call target function, but it would also get "this" pointer and the generic context as explicit arguments if the tail call target requires them. Arguments of generic reference types are passed as "object" so that the StoreArguments doesn't have to be generic.
+* CallTarget - this thunk gets the arguments buffer that was filled by the StoreArguments thunk, loads the arguments from the buffer, releases the buffer and calls the target function using calli. The signature used for the calli would ensure that all arguments including the optional hidden return buffer and the generic context are passed in the right registers / stack slots. Generic reference arguments will be specified as "object" in the signature so that the CallTarget doesn't have to be generic.
+The CallTarget is also generated as IL and compiled by the jitter or AOT compiler. There is one such thunk per tail call target. This thunk has the same return type as the tailcall target or return "object" if the return type of the tail call target is a generic reference type.
+* TailCallHelper - this is an assembler helper that is responsible for restoring stack pointer to the location where it was when the first function in a tail call chain was entered and then jumping to the CallTarget thunk. This helper is common for all tail call targets.
+In a context of each tailcall invocation, the TailCallHelper will be handled by the jitter as if it had the same return type as the tail call target. That means that if the tail call target needs a hidden return buffer for returning structs, the pointer to this buffer will be passed to the TailCallHelper the same way as it would be passed to the tail call target. The TailCallHelper would then pass this hidden argument to the CallTarget helper.
+There will be two flavors of this helper, based on whether the tail call target needs a hidden return buffer or not:
+ * TailCallHelper
+ * TailCallHelper_RetBuf
+
+### Helper data structures
+The tail calls use the following data structures:
+* Thread local storage for arguments. It stores the arguments of a tail call for a short period of time between the StoreArguments and CallTarget calls.
+* Arguments GC descriptor - descriptor of locations and types of managed references in the arguments.
+* TailCallHelperStack - a per thread stack of helper entries that is used to determine whether a tail call is chained or not. Its entries are allocated as local variables in CallTarget thunks. Each entry contains:
+ * Stack pointer captured right before a call to a tail call target
+ * ChainCall flag indicating whether the CallTarget thunk should return after the call to the tail call target or whether it should execute its epilog and jump to TailCallHelper instead. The latter is used by the TailCallHelper to remove the stack frame of the CallTarget before making a tail call from a tail called function.
+ * Pointer to the next entry on the stack.
+
+### Tail call sequence
+* The caller calls the StoreArguments thunk corresponding to the callee to store the pointer to the tail call target function, its arguments, their GC descriptors, optional "this" and generic context arguments and the corresponding CallTarget thunk address in a thread local storage.
+* The caller executes its epilog, restoring stack pointer and callee saved registers to their values when the caller was entered.
+* The caller jumps to the TailCallHelper. This function performs the following operations:
+ * Get the topmost TailCallHelperStack entry for the current thread.
+ * Check if the previous stack frame is a CallTarget thunk frame by comparing the stack pointer value stored in the TailCallHelperStack entry to the current CFA (call frame address). If it matches, it means that the previous stack frame belongs to a CallTarget thunk and so the tail call caller was also tail called.
+ * If the previous frame was a CallTarget thunk, its stack frame needs to be removed to ensure that the stack will not grow when tail calls are chained. Set ChainCall flag in the TailCallHelperStack entry and return. That returns control to the CallTarget thunk. It checks the ChainCall flag and since it set, it executes its epilog and jumps to the TailCallHelper again.
+ * If the previous frame was not a CallTarget thunk, get the address of the CallTarget thunk of the tailcall target from the arguments buffer and jump to it.
+* The CallTarget thunk function then does the following operation:
+ * Create local instance of TailCallHelperStack entry and store the current stack pointer value in it.
+ * Push the entry to the TailCallHelperStack of the current thread.
+ * Get the arguments buffer from the thread local storage, extract the regular arguments and the optional "this" and generic context arguments and the target function pointer. Release the buffer and call the target function. The frame of the CallTarget thunk ensures that the arguments of the target are GC protected until the target function returns or tail calls to another function.
+ * Pop the TailCallHelperStack entry from the TailCallHelperStack of the current thread.
+ * Check the ChainCall flag in the TailCallHelperStack entry. If it is set, run epilog and jump to the TailCallHelper.
+ * If the ChainCall flag is clear, it means that the last function in the tail call chain has returned. So return the return value of the target function.
+
+## Work that needs to be done to implement the new tail calls mechanism
+### JIT (compiler in the AOT scenario)
+* Modify compilation of tail calls with helper so that a tail call is compiled as a call to the StoreArguments thunk followed by the jump to the assembler TailCallHelper. In other words, the
+```
+tail. call/callvirt <method>
+ret
+```
+becomes
+```
+call/callvirt <StoreArguments thunk>
+tail. call <TailCallHelper>
+ret
+```
+### Runtime (compiler in the AOT scenario)
+* Add generation of the StoreArguments and CallTarget IL thunks to the runtime (compiler tool chain in at AOT scenario). As a possible optimization, In the AOT scenario, the thunks can be generated by the compiler as native code directly without the intermediate IL.
+* For the JIT scenario, add a new method to the JIT to EE interface to get the StoreArguments thunk method handle for a given target method and the TailCallHelper address.
+
+### Runtime in both scenarios
+* Add support for the arguments buffer, which means:
+ * Add functions to create, release and get the buffer for a thread
+ * Add support for GC scanning the arguments buffers.
+* Implement the TailCallHelper asm helper for all architectures
+### Debugging in both scenarios
+Ensure that the stepping in a debugger works correctly. In CoreCLR, the TailCallStubManager needs to be updated accordingly.
+
+## Example code
+
+```C#
+struct S
+{
+ public S(long p1, long p2, long p3)
+ {
+ s1 = p1; s2 = p2; s3 = p3;
+ }
+
+ public long s1, s2, s3;
+}
+
+struct T
+{
+ public T(S s)
+ {
+ t1 = s.s1; t2 = s.s2; t3 = s.s3 t4 = 4;
+ }
+ public long t1, t2, t3, t4;
+}
+
+struct U
+{
+ public U(T t)
+ {
+ u1 = t.t1; u2 = t.t2; u3 = t.t3 u4 = t.t4; u5 = 5;
+ }
+ public long u1, u2, u3, u4, u5;
+}
+
+int D(U u)
+{
+ int local;
+ Console.WriteLine("In C, U = [{0}, {1}, {2}, {3}, {4}", u.u1, u.u2, u.u3, u.u4, u.u5);
+ return 1;
+}
+
+int C(T t)
+{
+ int local;
+ Console.WriteLine("In C");
+ U args = new U(t);
+ return tailcall D(args);
+}
+
+int B(S s)
+{
+ int local;
+ Console.WriteLine("In B");
+ T args = new T(S);
+ return tailcall C(args);
+}
+
+int A()
+{
+ S args = new S(1, 2, 3);
+ int result = B(args);
+ Console.WriteLine("Done, result = {0}\n", result);
+}
+```
+
+## Example code execution
+This section shows how stack evolves during the execution of the example code above. Execution starts at function A, but the details below start at the interesting point where the first tail call is about to be called.
+### B is about to tail call C
+```
+* Return address of A
+* Callee saved registers and locals of A
+* Stack arguments of B
+* Return address of B
+* Callee saved registers and locals of B
+```
+### Arguments of C are stored in the thread local buffer, now we are in the TailCallHelper
+The callee saved registers and locals of B are not on the stack anymore
+```
+* Return address of A
+* Callee saved registers and locals of A
+* Stack arguments of B
+* Return address of B
+```
+
+### In CallTarget thunk for C, about to call C
+The thunk will now extract parameters for C from the thread local storage and call C.
+```
+* Return address of A
+* Callee saved registers and locals of A
+* Stack arguments of B
+* Return address of B
+* Callee saved registers and locals of CallTarget thunk for C
+```
+### C is about to tail call D
+```
+* Return address of A
+* Callee saved registers and locals of A
+* Stack arguments of B
+* Return address of B
+* Callee saved registers and locals of CallTarget thunk for C
+* Stack arguments of C
+* Return address of C
+* Callee saved registers and locals of C
+```
+### Arguments of D are stored in the thread local buffer, now we are in the TailCallHelper
+The callee saved registers and locals of C are not on the stack anymore.
+But we still have the return address of C, stack arguments of C and callee saved registers and locals of CallTarget thunk for C on the stack.
+We need to remove them as well to prevent stack growing.
+The TailCallHelper detects that the previous stack frame was the frame of the CallTarget thunk for C and so it sets the ChainCall flag in the topmost TailCallHelperStackEntry and returns to CallTarget thunk for C in order to let it cleanup its stack frame.
+```
+* Return address of A
+* Callee saved registers and locals of A
+* Stack arguments of B
+* Return address of B
+* Callee saved registers and locals of CallTarget thunk for C
+* Stack arguments of C
+* Return address of C
+```
+### Returned to CallTarget thunk for C with ChainCall flag in the TailCallHelperStackEntry **set**
+The thunk checks the ChainCall flag and since it is set, it runs its epilog and then jumps to the TailCallHelper.
+```
+* Return address of A
+* Callee saved registers and locals of A
+* Stack arguments of B
+* Return address of B
+* Callee saved registers and locals of CallTarget thunk for C
+```
+### Back in TailCallHelper
+Now the stack is back in the state where we have made the previous tail call. Since the previous stack frame was not a CallTarget thunk frame, we just jump to the CallTarget thunk for D.
+```
+* Return address of A
+* Callee saved registers and locals of A
+* Stack arguments of B
+* Return address of B
+```
+### In CallTarget thunk for D, about to call D
+The thunk will now extract parameters for D from the thread local storage and call D.
+```
+* Return address of A
+* Callee saved registers and locals of A
+* Stack arguments of B
+* Return address of B
+* Callee saved registers and locals of CallTarget thunk for D
+```
+### In D
+We are in the last function of the chain, so after it does its work, it returns to its CallTarget thunk.
+```
+* Return address of A
+* Callee saved registers and locals of A
+* Stack arguments of B
+* Return address of B
+* Callee saved registers and locals of CallTarget thunk for D
+* Stack arguments of D
+* Return address of D
+* Callee saved registers and locals of D
+```
+### Returned to CallTarget thunk for D with ChainCall flag in the TailCallHelperStackEntry **clear**
+The thunk checks the ChainCall flag and since it is clear, it recognizes we are now returning from the call chain and so it returns the result of the D.
+```
+* Return address of A
+* Callee saved registers and locals of A
+* Stack arguments of B
+* Return address of B
+* Callee saved registers and locals of CallTarget thunk for D
+```
+### Returned to A
+We are back in A and we have the return value of the call chain.
+```
+* Return address of A
+* Callee saved registers and locals of A
+```
+
+## Example of thunks generated for a simple generic method
+
+```C#
+struct Point
+{
+ public int x;
+ public int y;
+ public int z;
+}
+
+class Foo
+{
+ public Point Test<T>(int x, T t) where T : class
+ {
+ Console.WriteLine("T: {0}", typeof(t));
+ return new Point(x, x, x);
+ }
+}
+```
+
+For tail calling the Test function, the IL helpers could look e.g. as follows:
+
+```IL
+.method public static void StoreArgumentsTest(native int target, object thisObj, native int genericContext, int x, object t) cil managed
+{
+ .maxstack 4
+ .locals init (
+ [0] native int buffer,
+ )
+
+ call native int AllocateArgumentBuffer(56)
+ stloc.0
+
+ ldloc.0
+ ldc.i8 0x12345678 // pointer to the GC descriptor of the arguments buffer
+ stind.i
+ ldloc.0
+ sizeof native int
+ add
+ stloc.0
+
+ ldloc.0
+ ldftn Point CallTargetTest()
+ stind.i
+ ldloc.0
+ sizeof native int
+ add
+ stloc.0
+
+ ldloc.0
+ ldarg.1 // "thisObj"
+ stobj object
+ ldloc.0
+ sizeof object
+ add
+ stloc.0
+
+ ldloc.0
+ ldarg.2 // "genericContext"
+ stind.i
+ ldloc.0
+ sizeof native int
+ add
+ stloc.0
+
+ ldloc.0
+ ldarg.3 // "x"
+ stind.i4
+ ldloc.0
+ sizeof native int
+ add
+ stloc.0
+
+ ldloc.0
+ ldarg.4 // "t"
+ stobj object
+ ldloc.0
+ sizeof object
+ add
+ stloc.0
+
+ ldloc.0
+ ldarg.0 // "target"
+ stind.i
+
+ ret
+}
+
+```
+
+```IL
+.method public static Point CallTargetTest() cil managed
+{
+ .maxstack 4
+ .locals init (
+ [0] native int buffer,
+ [1] valuetype TailCallHelperStackEntry entry,
+ [2] Point result
+ )
+
+ // Initialize the TailCallHelperStackEntry
+ // chainCall = false
+ // sp = current sp
+
+ ldloca.s 1
+ ldc.i4.0
+ stfld bool TailCallHelperStackEntry::chainCall
+ ldloca.s 1
+ call native int GetCurrentSp()
+ stfld native int TailCallHelperStackEntry::sp
+
+ ldloca.s 1 // TailCallHelperStackEntry
+ call void PushHelperStackEntry(native int)
+
+ // Prepare arguments for the tail call target
+
+ call native int FetchArgumentBuffer()
+ sizeof native int
+ add // skip the pointer to the GC descriptor of the arguments buffer
+ sizeof native int
+ add // skip the address of the CallTargetTest in the buffer, it is used by the TailCallHelper only
+ stloc.0
+
+ ldloc.0
+ ldobj object // this
+ ldloc.0
+ sizeof object
+ add
+ stloc.0
+
+ ldloc.0
+ ldind.i // generic context
+ ldloc.0
+ sizeof native int
+ add
+ stloc.0
+
+ ldloc.0
+ ldind.i4 // int x
+ ldloc.0
+ sizeof native int
+ add
+ stloc.0
+
+ ldloc.0
+ ldobj object // T t
+ ldloc.0
+ sizeof object
+ add
+
+ ldobj native int // tailcall target
+
+ // The arguments buffer is not needed anymore
+ call void ReleaseArgumentBuffer()
+
+ .try
+ {
+ calli Point (object, native int, int32, object) // this, generic context, x, t
+ stloc.2
+ leave.s Done
+ }
+ finally
+ {
+ ldloca.s 1 // TailCallHelperStackEntry
+ call void PopHelperStackEntry(native int)
+ endfinally
+ }
+
+Done:
+ ldloc.1 // TailCallHelperStackEntry
+ ldfld bool TailCallHelperStackEntry::chainCall
+ brfalse.s NotChained
+
+ // Jump to the TailCallHelper that will call to the next tail call in the chain.
+ // The stack frame of the current CallTargetTest is reclaimed and epilog executed
+ // before the TailCallHelper is entered.
+ tail.call int32 TailCallHelper()
+ ret
+
+NotChained:
+ // Now we are returning from a chain of tail calls to the caller of this chain
+ ldloc.2
+ ret
+}
+``` \ No newline at end of file
diff --git a/Documentation/project-docs/contributing-workflow.md b/Documentation/project-docs/contributing-workflow.md
index 5c91a49..70423e5 100644
--- a/Documentation/project-docs/contributing-workflow.md
+++ b/Documentation/project-docs/contributing-workflow.md
@@ -85,7 +85,7 @@ We use and recommend the following workflow:
1. Create an issue for your work.
- You can skip this step for trivial changes.
- Reuse an existing issue on the topic, if there is one.
- - Use [CODE_OWNERS.TXT](https://github.com/dotnet/coreclr/blob/CODE_OWNERS.TXT) to find relevant maintainers and @ mention them to ask for feedback on your issue.
+ - Use [CODE_OWNERS.TXT](https://github.com/dotnet/coreclr/blob/master/CODE_OWNERS.TXT) to find relevant maintainers and @ mention them to ask for feedback on your issue.
- Get agreement from the team and the community that your proposed change is a good one.
- If your change adds a new API, follow the [API Review Process](https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/api-review-process.md).
- Clearly state that you are going to take on implementing it, if that's the case. You can request that the issue be assigned to you. Note: The issue filer and the implementer don't have to be the same person.
diff --git a/Documentation/project-docs/glossary.md b/Documentation/project-docs/glossary.md
index 670254d..f4ae1c5 100644
--- a/Documentation/project-docs/glossary.md
+++ b/Documentation/project-docs/glossary.md
@@ -5,6 +5,7 @@ This glossary defines terms, both common and more niche, that are important to u
As much as possible, we should link to the most authoritative and recent source of information for a term. That approach should be the most helpful for people who want to learn more about a topic.
+* BBT: Microsoft internal early version of C/C++ PGO. See https://www.microsoft.com/windows/cse/bit_projects.mspx.
* BOTR: Book Of The Runtime.
* CLR: Common Language Runtime.
* COMPlus: An early name for the .NET platform, back when it was envisioned as a successor to the COM platform (hence, "COM+"). Used in various places in the CLR infrastructure, most prominently as a common prefix for the names of internal configuration settings. Note that this is different from the product that eventually ended up being named [COM+](https://msdn.microsoft.com/en-us/library/windows/desktop/ms685978.aspx).
@@ -12,19 +13,25 @@ As much as possible, we should link to the most authoritative and recent source
* DAC: Data Access Component. An abstraction layer over the internal structures in the runtime.
* EE: Execution Engine.
* GC: [Garbage Collector](https://github.com/dotnet/coreclr/blob/master/Documentation/botr/garbage-collection.md).
+* IPC: Inter-Process Communicaton
* JIT: [Just-in-Time](https://github.com/dotnet/coreclr/blob/master/Documentation/botr/ryujit-overview.md) compiler. RyuJIT is the code name for the next generation Just-in-Time(aka "JIT") for the .NET runtime.
* LCG: Lightweight Code Generation. An early name for [dynamic methods](https://github.com/dotnet/coreclr/blob/master/src/mscorlib/src/System/Reflection/Emit/DynamicMethod.cs).
+* MD: MetaData
* NGen: Native Image Generator.
* NYI: Not Yet Implemented
* PAL: [Platform Adaptation Layer](http://archive.oreilly.com/pub/a/dotnet/2002/03/04/rotor.html). Provides an abstraction layer between the runtime and the operating system
* PE: Portable Executable.
+* PGO: Profile Guided Optimization - see [details](https://blogs.msdn.microsoft.com/vcblog/2008/11/12/pogo/)
+* POGO: Profile Guided Optimization - see [details](https://blogs.msdn.microsoft.com/vcblog/2008/11/12/pogo/)
* ProjectN: Codename for the first version of [.NET Native for UWP](https://msdn.microsoft.com/en-us/vstudio/dotnetnative.aspx).
* ReadyToRun: A flavor of native images - command line switch of [crossgen](../building/crossgen.md).
* Redhawk: Codename for experimental minimal managed code runtime that evolved into [CoreRT](https://github.com/dotnet/corert/).
* SOS: [Son of Strike](http://blogs.msdn.com/b/jasonz/archive/2003/10/21/53581.aspx). The debugging extension for DbgEng based debuggers. Uses the DAC as an abstraction layer for its operation.
+* SuperPMI: JIT component test framework (super fast JIT testing - it mocks/replays EE in EE-JIT interface) - see [SuperPMI details](https://github.com/dotnet/coreclr/blob/master/src/ToolBox/superpmi/readme.txt).
* SVR: The CLR used to be built as two variants, with one called "mscorsvr.dll", to mean the "server" version. In particular, it contained the server GC implementation, which was intended for multi-threaded apps capable of taking advantage of multiple processors. In the .NET Framework 2 release, the two variants were merged into "mscorwks.dll". The WKS version was the default, however the SVR version remained available.
* TPA: Trusted Platform Assemblies used to be a special set of assemblies that comprised the platform assemblies, when it was originally designed. As of today, it is simply the set of assemblies known to constitute the application.
* URT: Universal Runtime. Ancient name for what ended up being .NET, is used in the WinError facility name FACILITY_URT.
* VSD: [Virtual Stub Dispatch](../botr/virtual-stub-dispatch.md). Technique of using stubs for virtual method invocations instead of the traditional virtual method table.
* VM: Virtual machine.
* WKS: The CLR used to be built as two variants, with one called "mscorwks.dll", to mean the "workstation" version. In particular, it contained the client GC implementation, which was intended for single-threaded apps, independent of how many processors were on the machine. In the .NET Framework 2 release, the two variants were merged into "mscorwks.dll". The WKS version was the default, however the SVR version remained available.
+* ZAP: Original code name for NGen
diff --git a/Documentation/workflow/IssuesFeedbackEngagement.md b/Documentation/workflow/IssuesFeedbackEngagement.md
index a6c0550..f83b2b6 100644
--- a/Documentation/workflow/IssuesFeedbackEngagement.md
+++ b/Documentation/workflow/IssuesFeedbackEngagement.md
@@ -3,10 +3,14 @@
## Reporting Problems (Bugs)
-We track bugs, feature requests and other issues [in this repo](https://github.com/dotnet/coreclr/issues).
-If you have a problem and believe that the issue is in the native runtime you should log it there. If in the managed code log it in the [CoreFX repo](https://github.com/dotnet/corefx/issues) _even if the code is in this CoreCLR repo_ (ie., in mscorlib/System.Private.Corelib). The reason for this is we sometimes move managed types between the two and it makes sense to keep all the issues together.
-
-Before you log a new issue, you should try using the search tool on the issue page on a few keywords to see if the issue was already logged.
+We track bugs, feature requests and other issues in repository where they will get ultimately fixed. If you have a problem and
+believe that the issue is in CoreCLR itself (native runtime or System.Private.CoreLib base level class library) you should
+log it on the [CoreCLR Issues Page](https://github.com/dotnet/coreclr/issues). If in the upper levels of the class library
+use the [CoreFX Issues Page](https://github.com/dotnet/corefx/issues). For all managed API addition proposals use
+the [CoreFX Issues Page](https://github.com/dotnet/corefx/issues) and follow
+the [API Review Process](https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/api-review-process.md).
+
+Before you log a new issue, you should try using the search tool on the issue page on a few keywords to see if the issue was already logged.
### NET Forums
If you want to ask a question, or want wider discussion (to see if others share you issue), we encourage you to start a thread
diff --git a/Documentation/workflow/OfficalAndDailyBuilds.md b/Documentation/workflow/OfficalAndDailyBuilds.md
index 5a36af6..d5efb93 100644
--- a/Documentation/workflow/OfficalAndDailyBuilds.md
+++ b/Documentation/workflow/OfficalAndDailyBuilds.md
@@ -1,7 +1,7 @@
# Official Releases and Daily Builds of CoreCLR and CoreFX components
If you are not planning on actually making bug fixes or experimenting with new features, then you probably
-don't need to don't need build CoreCLR yourself, as the .NET Runtime team routinely does this for you.
+don't need to build CoreCLR yourself, as the .NET Runtime team routinely does this for you.
Roughly every three months, the .NET Runtime team publishes a new version of .NET Core to Nuget. .NET Core's
official home on NuGet is