summaryrefslogtreecommitdiff
path: root/Documentation/botr
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/botr')
-rw-r--r--Documentation/botr/_tableOfContents.md34
-rw-r--r--Documentation/botr/botr-faq.md46
-rw-r--r--Documentation/botr/clr-abi.md661
-rw-r--r--Documentation/botr/dac-notes.md213
-rw-r--r--Documentation/botr/exceptions.md299
-rw-r--r--Documentation/botr/garbage-collection.md332
-rw-r--r--Documentation/botr/intro-to-clr.md261
-rw-r--r--Documentation/botr/method-descriptor.md343
-rw-r--r--Documentation/botr/mscorlib.md357
-rw-r--r--Documentation/botr/porting-ryujit.md112
-rw-r--r--Documentation/botr/profilability.md240
-rw-r--r--Documentation/botr/profiling.md513
-rw-r--r--Documentation/botr/readytorun-overview.md335
-rw-r--r--Documentation/botr/ryujit-overview.md558
-rw-r--r--Documentation/botr/stackwalking.md85
-rw-r--r--Documentation/botr/threading.md210
-rw-r--r--Documentation/botr/type-loader.md317
-rw-r--r--Documentation/botr/type-system.md233
-rw-r--r--Documentation/botr/virtual-stub-dispatch.md188
19 files changed, 5337 insertions, 0 deletions
diff --git a/Documentation/botr/_tableOfContents.md b/Documentation/botr/_tableOfContents.md
new file mode 100644
index 0000000000..db4ffc121c
--- /dev/null
+++ b/Documentation/botr/_tableOfContents.md
@@ -0,0 +1,34 @@
+
+#The Book of the Runtime
+
+Welcome to the Book of the Runtime (BOTR) for the .NET Runtime. This contains
+a collection of articles about the non-trivial internals of the .NET Runtime. Its
+intended audience are people actually modifying the code or simply wishing to have a
+deep understanding of the runtime.
+
+Below is a table of contents.
+
+- [Book of the Runtime FAQ](botr-faq.md)
+- [Introduction to the Common Language Runtime](intro-to-clr.md)
+- [Garbage Collection Design](garbage-collection.md)
+- [Threading](threading.md)
+- [RyuJIT Overview](ryujit-overview.md)
+ - [Porting RyuJIT to other platforms](porting-ryujit.md)
+- [Type System](type-system.md)
+- [Type Loader](type-loader.md)
+- [Method Descriptor](method-descriptor.md)
+- [Virtual Stub Dispatch](virtual-stub-dispatch.md)
+- [Stack Walking](stackwalking.md)
+- [Mscorlib and Calling Into the Runtime](mscorlib.md)
+- [Data Access Component (DAC) Notes](dac-notes.md)
+- [Profiling](profiling.md)
+- [Implementing Profilability](profilability.md)
+- [What Every Dev needs to Know About Exceptions in the Runtime](exceptions.md)
+- [ReadyToRun Overview](readytorun-overview.md)
+- [CLR ABI](clr-abi.md)
+
+
+It may be possible that this table is not complete. You can get a complete list
+by looking at the directory where all the chapters are stored:
+
+* [All Book of the Runtime (BOTR) chapters on GitHub](../botr)
diff --git a/Documentation/botr/botr-faq.md b/Documentation/botr/botr-faq.md
new file mode 100644
index 0000000000..b1a1de4c13
--- /dev/null
+++ b/Documentation/botr/botr-faq.md
@@ -0,0 +1,46 @@
+Book of the Runtime (BotR) FAQ
+===
+
+# What is the BotR?
+
+The [Book of the Runtime](https://github.com/dotnet/coreclr#learn-about-coreclr) is a set of documents that describe components in the CLR and BCL. They are intended to focus more on architecture and invariants and not an annotated description of the codebase.
+
+It was originally created within Microsoft in ~ 2007, including this document. Developers were responsible to document their feature areas. This helped new devs joining the team and also helped share the product architecture across the team.
+
+We realized that the BotR is even more valuable now, with CoreCLR being open source on GitHub. We are publishing BotR chapters to help a new set of CLR developers.
+
+Each of the BoTR documents were written with a [certain perspective](https://github.com/dotnet/coreclr/pull/115), both in terms of the timeframe and the author. We did not think it was right to mutate the documents to make them more "2015". They remain the docs that they were, modulo a few spelling corrections and a conversion to markdown. That said, we'll accept PRs to the docs to improve them.
+
+# Who is the main audience of BotR?
+
+- Developers who are working on bugs that impinge on an area and need a high level overview of the component.
+- Developers working on new features with dependencies on a component need to know enough about it to ensure the new feature will interact correctly with existing components.
+- New developers need this chapter to maintain a given component.
+
+# What should be in a BotR chapter?
+
+The purpose of Book of the Runtime chapters is to capture information that we cannot easily reconstruct from the functional specification and source code alone, and to enable communication at a high level between team members. It explains concepts and presents a top-down description, and mostly importantly, explains why we made the design decisions we made.
+
+# How is this different from a design doc?
+
+A design doc is what you write before you start implementation. A BotR chapter is usually written after a feature is implemented, at which point you have already decided the pros and cons of various design options and settled on one (and perhaps have plans to use an improved design in the future), and have a much better idea about all the details (some of which could be very hard to think of without actually going through the implementation/testing). So you can talk about rationales behind design decisions a lot better.
+
+# I am a new dev and not familiar with any features yet, how can I contribute?
+
+A new dev can be a great contributor to BotR as one of the most important purposes of BotR is to help new devs with getting up to speed. Here are some ways you can contribute:
+
+- Be a reviewer! If you think some things are not clear or could be explained better, do not hesitate to contact the author of the chapter and chat with him/her to see how you can make it more understandable.
+- As you are getting up to speed in your area, look over the BotR chapters for your area and see if there are any errors or anything that requires an update and make the modifications yourself.
+- Volunteer to write a chapter or part of a chapter. This might seem like a daunting task but you can start by just accumulating knowledge - take notes as you learn stuff about your area and gradually mold it into a BotR chapter.
+
+# What are the responsibilities of a BotR reviewer?
+
+As a reviewer you will be expected to give constructive comments on the chapter you are reviewing. You can comment on any aspect, eg. the technical depth, writing style, content coverage. Keep in mind that BotR is mostly about design and architectural issues that may not be obvious. It is not meant to walk you through implementation details. Please focus on that.
+
+# I _really_ don't have time to work on a BotR chapter – it seems like I always have other things to do. What do I do?
+
+Here are some ways I think would be useful when working on BotR.
+
+- Spread the work out; don't make it a workitem as in "I will need to spend the next Monday through Thursday to work on my chapter"; think of it more like something you do when you want to take a break from coding or bug fixing, or just a change of scenery. I find it much easier to spend a little time here and there working on a chapter than having to specifically allocate a contiguous number of days which always seem hard to come by.
+- Have someone else write the chapter or most of the chapter for you. I am not joking. This is actually a very good way to help new devs ramp up. If you will be mentoring a new dev in your area, spend time with them to explain the feature area and encourage them to write a BotR chapter if one doesn't already exist. Of course be a reviewer of it.
+- Use other documentation that is already there. There are MSDN docs and blog posts on .NET features. This can certainly be a base for your BotR chapter as well.
diff --git a/Documentation/botr/clr-abi.md b/Documentation/botr/clr-abi.md
new file mode 100644
index 0000000000..cbd5fc903c
--- /dev/null
+++ b/Documentation/botr/clr-abi.md
@@ -0,0 +1,661 @@
+# CLR ABI
+
+This document describes the .NET Common Language Runtime (CLR) software conventions (or ABI, "Application Binary Interface"). It focusses on the ABI for the x64 (aka, AMD64), ARM (aka, ARM32 or Thumb-2), and ARM64 processor architectures. Documentation for the x86 ABI is somewhat scant.
+
+It describes requirements that the Just-In-Time (JIT) compiler imposes on the VM and vice-versa.
+
+A note on the JIT codebases: JIT32 refers to the original JIT codebase that originally generated x86 code and was later ported to generate ARM code. Later, it was ported and re-architected to generate AMD64 code (making its name something of a confusing misnomer). This work is referred to as RyuJIT. RyuJIT is being ported to generate ARM64 code. JIT64 refers to the legacy codebase that supports AMD64.
+
+# Getting started
+
+Read everything in the documented Windows ABI.
+
+AMD64: See "x64 Software Conventions" on MSDN: https://msdn.microsoft.com/en-us/library/7kcdt6fy.aspx.
+
+ARM: See "Overview of ARM ABI Conventions" on MSDN: https://msdn.microsoft.com/en-us/library/dn736986.aspx.
+
+The CLR follows those basic conventions. This document only describes things that are CLR-specific, or exceptions from those documents.
+
+# General Unwind/Frame Layout
+
+For all non-x86 platforms, all methods must have unwind information so the garbage collector (GC) can unwind them (unlike native code in which a leaf method may be omitted).
+
+ARM and ARM64: Managed methods must always push LR on the stack, and create a minimal frame, so that the method can be properly hijacked using return address hijacking.
+
+# Special/extra parameters
+
+## The "this" pointer
+
+The managed "this" pointer is treated like a new kind of argument not covered by the native ABI, so we chose to always pass it as the first argument in (AMD64) `RCX` or (ARM, ARM64) `R0`.
+
+AMD64-only: Up to .NET Framework 4.5, the managed "this" pointer was treated just like the native "this" pointer (meaning it was the second argument when the call used a return buffer and was passed in RDX instead of RCX). Starting with .NET Framework 4.5, it is always the first argument.
+
+## Varargs
+
+Varargs refers to passing or receiving a variable number of arguments for a call.
+
+C# varargs, using the `params` keyword, are at the IL level just normal calls with a fixed number of parameters.
+
+Managed varargs (using C#'s pseudo-documented "...", `__arglist`, etc.) are implemented almost exactly like C++ varargs. The biggest difference is that the JIT adds a "vararg cookie" after the optional return buffer and the optional "this" pointer, but before any other user arguments. The callee must spill this cookie and all subsequent arguments into their home location, as they may be addressed via pointer arithmetic starting with the cookie as a base. The cookie happens be to a pointer to a signature that the runtime can parse to (1) report any GC pointers within the variable portion of the arguments or (2) type-check (and properly walk over) any arguments extracted via ArgIterator. This is marked by `IMAGE_CEE_CS_CALLCONV_VARARG`, which should not be confused with `IMAGE_CEE_CS_CALLCONV_NATIVEVARARG`, which really is exactly native varargs (no cookie) and should only appear in PInvoke IL stubs, which properly handle pinning and other GC magic.
+
+On AMD64, just like native, any floating point arguments passed in floating point registers (including the fixed arguments) will be shadowed (i.e. duplicated) in the integer registers.
+
+On ARM and ARM64, just like native, nothing is put in the floating point registers.
+
+However, unlike native varargs, all floating point arguments are not promoted to double (`R8`), and instead retain their original type (`R4` or `R8`) (although this does not preclude an IL generator like managed C++ from explicitly injecting an upcast at the call-site and adjusting the call-site-sig appropriately). This leads to unexpected behavior when native C++ is ported to C# or even just managed via the different flavors of managed C++.
+
+Managed varargs are not supported in .NET Core.
+
+## Generics
+
+*Shared generics*. In cases where the code address does not uniquely identify a generic instantiation of a method, then a 'generic instantiation parameter' is required. Often the "this" pointer can serve dual-purpose as the instantiation parameter. When the "this" pointer is not the generic parameter, the generic parameter is passed as the next argument (after the optional return buffer and the optional "this" pointer, but before any user arguments). For generic methods (where there is a type parameter directly on the method, as compared to the type), the generic parameter currently is a MethodDesc pointer (I believe an InstantiatedMethodDesc). For static methods (where there is no "this" pointer) the generic parameter is a MethodTable pointer/TypeHandle.
+
+Sometimes the VM asks the JIT to report and keep alive the generics parameter. In this case, it must be saved on the stack someplace and kept alive via normal GC reporting (if it was the "this" pointer, as compared to a MethodDesc or MethodTable) for the entire method except the prolog and epilog. Also note that the code to home it, must be in the range of code reported as the prolog in the GC info (which probably isn't the same as the range of code reported as the prolog in the unwind info).
+
+There is no defined/enforced/declared ordering between the generic parameter and the varargs cookie because the runtime does not support that combination. There are chunks of code in the VM and JITs that would appear to support that, but other places assert and disallow it, so nothing is tested, and I would assume there are bugs and differences (i.e. one JIT using a different ordering than the other JIT or the VM).
+
+### Example
+```
+call(["this" pointer] [return buffer pointer] [generics context|varargs cookie] [userargs]*)
+```
+
+## AMD64-only: by-value value types
+
+Just like native, AMD64 has implicit-byrefs. Any structure (value type in IL parlance) that is not 1, 2, 4, or 8 bytes in size (i.e., 3, 5, 6, 7, or >= 9 bytes in size) that is declared to be passed by value, is instead passed by reference. For JIT generated code, it follows the native ABI where the passed-in reference is a pointer to a compiler generated temp local on the stack. However, there are some cases within remoting or reflection where apparently stackalloc is too hard, and so they pass in pointers within the GC heap, thus the JITed code must report these implicit byref parameters as interior pointers (BYREFs in JIT parlance), in case the callee is one of these reflection paths. Similarly, all writes must use checked write barriers.
+
+The AMD64 native calling conventions (Windows 64 and System V) require return buffer address to be returned by callee in RAX. JIT also follows this rule.
+
+## Return buffers
+
+The same applies to some return buffers. See `MethodTable::IsStructRequiringStackAllocRetBuf()`. When that returns false, the return buffer might be on the heap, either due to reflection/remoting code paths mentioned previously or due to a JIT optimization where a call with a return buffer that then assigns to a field (on the GC heap) are changed into passing the field reference as the return buffer. Conversely, when it returns true, the JIT does not need to use a write barrier when storing to the return buffer, but it is still not guaranteed to be a compiler temp, and as such the JIT should not introduce spurious writes to the return buffer.
+
+NOTE: This optimization is now disabled for all platforms (`IsStructRequiringStackAllocRetBuf()` always returns FALSE).
+
+## Hidden parameters
+
+*Stub dispatch* - when a virtual call uses a VSD stub, rather than back-patching the calling code (or disassembling it), the JIT must place the address of the stub used to load the call target, the "stub indirection cell", in (x86) `EAX` / (AMD64) `R11` / (ARM) `R4` / (ARM64) `R11`. In the JIT, this is `REG_VIRTUAL_STUB_PARAM`.
+
+AMD64-only: Fast Pinvoke - The VM wants a conservative estimate of the size of the stack arguments placed in `R11`. (This is consumed by callout stubs used in SQL hosting).
+
+*Calli Pinvoke* - The VM wants the address of the PInvoke in (AMD64) `R10` / (ARM) `R12` / (ARM64) `R14` (In the JIT: `REG_PINVOKE_TARGET_PARAM`), and the signature (the pinvoke cookie) in (AMD64) `R11` / (ARM) `R4` / (ARM64) `R15` (in the JIT: `REG_PINVOKE_COOKIE_PARAM`).
+
+*Normal PInvoke* - The VM shares IL stubs based on signatures, but wants the right method to show up in call stack and exceptions, so the MethodDesc for the exact PInvoke is passed in the (x86) `EAX` / (AMD64) `R10` / (ARM, ARM64) `R12` (in the JIT: `REG_SECRET_STUB_PARAM`). Then in the IL stub, when the JIT gets `CORJIT_FLG_PUBLISH_SECRET_PARAM`, it must move the register into a compiler temp. The value is returned for the intrinsic `CORINFO_INTRINSIC_StubHelpers_GetStubContext`, and the address of that location is returned for `CORINFO_INTRINSIC_StubHelpers_GetStubContextAddr`.
+
+# PInvokes
+
+The convention is that any method with an InlinedCallFrame (either an IL stub or a normal method with an inlined pinvoke) saves/restores all non-volatile integer registers in its prolog/epilog respectively. This is done so that the InlinedCallFrame can just contain a return address, a stack pointer and a frame pointer. Then using just those three it can start a full stack walk using the normal RtlVirtualUnwind.
+
+For AMD64, a method with an InlinedCallFrame must use RBP as the frame register.
+
+For ARM and ARM64, we will also always use a frame pointer (R11). That is partially due to the frame chaining requirement. However, the VM also requires it for PInvokes with InlinedCallFrames.
+
+For ARM, the VM also has a dependency on `REG_SAVED_LOCALLOC_SP`.
+
+All these dependencies show up in the implementation of `InlinedCallFrame::UpdateRegDisplay`.
+
+JIT32 only generates one epilog (and causes all returns to branch to it) when there are PInvokes/InlinedCallFrame in the current method.
+
+## Per-frame PInvoke initialization
+
+The InlinedCallFrame is initialized once at the head of IL stubs and once in each path that does an inlined PInvoke.
+
+In JIT64 this happens in blocks that actually contain calls, but pushing it out of loops that have landing pads, and then looking for dominator blocks. For IL stubs and methods with EH, we give up and place the initialization in the first block.
+
+In RyuJIT/JIT32 (ARM), all methods are treated like JIT64's IL stubs (meaning the per-frame initialization happens once just after the prolog).
+
+The JIT generates a call to `CORINFO_HELP_INIT_PINVOKE_FRAME` passing the address of the InlinedCallFrame and either NULL or the secret parameter for IL stubs. `JIT_InitPInvokeFrame` initializes the InlinedCallFrame and sets it to point to the current Frame chain top. Then it returns the current thread's native Thread object.
+
+On AMD64, the JIT generates code to save RSP and RBP into the InlinedCallFrame.
+
+For IL stubs only, the per-frame initialization includes setting `Thread->m_pFrame` to the InlinedCallFrame (effectively 'pushing' the Frame).
+
+## Per-call-site PInvoke work
+
+1. For direct calls, the JITed code sets `InlinedCallFrame->m_pDatum` to the MethodDesc of the call target.
+ * For JIT64, indirect calls within IL stubs sets it to the secret parameter (this seems redundant, but it might have changed since the per-frame initialization?).
+ * For JIT32 (ARM) indirect calls, it sets this member to the size of the pushed arguments, according to the comments. The implementation however always passed 0.
+2. For JIT64/AMD64 only: Next for non-IL stubs, the InlinedCallFrame is 'pushed' by setting `Thread->m_pFrame` to point to the InlinedCallFrame (recall that the per-frame initialization already set `InlinedCallFrame->m_pNext` to point to the previous top). For IL stubs this step is accomplished in the per-frame initialization.
+3. The Frame is made active by setting `InlinedCallFrame->m_pCallerReturnAddress`.
+4. The code then toggles the GC mode by setting `Thread->m_fPreemptiveGCDisabled = 0`.
+5. Starting now, no GC pointers may be live in registers.
+6. Then comes the actual call/PInvoke.
+7. The GC mode is set back by setting `Thread->m_fPreemptiveGCDisabled = 1`.
+8. Then we check to see if `g_TrapReturningThreads` is set (non-zero). If it is, we call `CORINFO_HELP_STOP_FOR_GC`.
+ * For ARM, this helper call preserves the return register(s): `R0`, `R1`, `S0`, and `D0`.
+ * For AMD64, the generated code must manually preserve the return value of the PInvoke by moving it to a non-volatile register or a stack location.
+9. Starting now, GC pointers may once again be live in registers.
+10. Clear the `InlinedCallFrame->m_pCallerReturnAddress` back to 0.
+11. For JIT64/AMD64 only: For non-IL stubs 'pop' the Frame chain by resetting `Thread->m_pFrame` back to `InlinedCallFrame.m_pNext`.
+
+Saving/restoring all the non-volatile registers helps by preventing any registers that are unused in the current frame from accidentally having a live GC pointer value from a parent frame. The argument and return registers are 'safe' because they cannot be GC refs. Any refs should have been pinned elsewhere and instead passed as native pointers.
+
+For IL stubs, the Frame chain isn't popped at the call site, so instead it must be popped right before the epilog and right before any jmp calls. It looks like we do not support tail calls from PInvoke IL stubs?
+
+# Exception handling
+
+This section describes the conventions the JIT needs to follow when generating code to implement managed exception handling (EH). The JIT and VM must agree on these conventions for a correct implementation.
+
+## Funclets
+
+For non-x86 platforms, all managed EH handlers (finally, fault, filter, filter-handler, and catch) are extracted into their own 'funclets'. To the OS they are treated just like first class functions (separate PDATA and XDATA (`RUNTIME_FUNCTION` entry), etc.). The CLR currently treats them just like part of the parent function in many ways. The main function and all funclets must be allocated in a single code allocation (see hot cold splitting). They 'share' GC info. Only the main function prolog can be hot patched.
+
+The only way to enter a handler funclet is via a call. In the case of an exception, the call is from the VM's EH subsystem as part of exception dispatch/unwind. In the non-exceptional case, this is called local unwind or a non-local exit. In C# this is accomplished by simply falling-through/out of a try body or an explicit goto. In IL this is always accomplished via a LEAVE opcode, within a try body, targeting an IL offset outside the try body. In such cases the call is from the JITed code of the parent function.
+
+For x86, all handlers are generated within the method body, typically in lexical order. A nested try/catch is generated completely within the EH region in which it is nested. These handlers are essentially "in-line funclets", but they do not look like normal functions: they do not have a normal prolog or epilog, although they do have special entry/exit and register conventions. Also, nested handlers are not un-nested as for funclets: the code for a nested handler is generated within the handler in which it is nested.
+
+## Cloned finallys
+
+JIT64 attempts to speed the normal control flow by 'inlining' a called finally along the 'normal' control flow (i.e., leaving a try body in a non-exceptional manner via C# fall-through). Because the VM semantics for non-rude Thread.Abort dictate that handlers will not be aborted, the JIT must mark these 'inlined' finally bodies. These show up as special entries at the end of the EH tables and are marked with `COR_ILEXCEPTION_CLAUSE_FINALLY | COR_ILEXCEPTION_CLAUSE_DUPLICATED`, and the try_start, try_end, and handler_start are all the same: the start of the cloned finally.
+
+JIT32 and RyuJIT currently do not implement finally cloning.
+
+## Invoking Finallys/Non-local exits
+
+In order to have proper forward progress and `Thread.Abort` semantics, there are restrictions on where a call-to-finally can be, and what the call site must look like. The return address can **NOT** be in the corresponding try body (otherwise the VM would think the finally protects itself). The return address **MUST** be within any outer protected region (so exceptions from the finally body are properly handled).
+
+JIT64, and RyuJIT for AMD64 and ARM64, creates something similar to a jump island: a block of code outside the try body that calls the finally and then branches to the final target of the leave/non-local-exit. This jump island is then marked in the EH tables as if it were a cloned finally. The cloned finally clause prevents a Thread.Abort from firing before entering the handler. By having the return address outside of the try body we satisfy the other constraint.
+
+Note that ARM solves this by not using a call (bl) instruction and instead explicitly places a return address in `LR` and then jumps to the finally. We have not yet implemented this for AMD64 because it might mess up the call-return predictor on the CPU. (So far performance data on ARM indicates they don't have an issue).
+
+## ThreadAbortException considerations
+
+There are three kinds of thread abort: (1) rude thread abort, that cannot be stopped, and doesn't run (all?) handlers, (2) calls to the `Thread.Abort()` api, and (3) asynchronous thread abort, injected from another thread.
+
+Note that ThreadAbortException is fully available in the desktop framework, and is heavily used in ASP.NET, for example. However, it is not supported in .NET Core, CoreCLR, or the Windows 8 "modern app profile". Nonetheless, the JIT generates ThreadAbort-compatible code on all platforms.
+
+For non-rude thread abort, the VM walks the stack, running any catch handler that catches ThreadAbortException (or a parent, like System.Exception, or System.Object), and running finallys. There is one very particular characteristic of ThreadAbortException: if a catch handler has caught ThreadAbortException, and the handler returns from handling the exception without calling Thread.ResetAbort(), then the VM *automatically re-raises ThreadAbortException*. To do so, it uses the resume address that the catch handler returned as the effective address where the re-raise is considered to have been raised. This is the address of the label that is specified by a LEAVE opcode within the catch handler. There are cases where the JIT must insert synthetic "step blocks" such that this label is within an appropriate enclosing "try" region, to ensure that the re-raise can be caught by an enclosing catch handler.
+
+For example:
+
+```
+try { // try 1
+ try { // try 2
+ System.Threading.Thread.CurrentThread.Abort();
+ } catch (System.Threading.ThreadAbortException) { // catch 2
+ ...
+ LEAVE L;
+ }
+} catch (System.Exception) { // catch 1
+ ...
+}
+L:
+```
+
+In this case, if the address returned in catch 2 corresponding to label L is outside try 1, then the ThreadAbortException re-raised by the VM will not be caught by catch 1, as is expected. The JIT needs to insert a block such that this is the effective code generation:
+
+```
+try { // try 1
+ try { // try 2
+ System.Threading.Thread.CurrentThread.Abort();
+ } catch (System.Threading.ThreadAbortException) { // catch 2
+ ...
+ LEAVE L';
+ }
+ L': LEAVE L;
+} catch (System.Exception) { // catch 1
+ ...
+}
+L:
+```
+
+Similarly, the automatic re-raise address for a ThreadAbortException can't be within a finally handler, or the VM will abort the re-raise and swallow the exception. This can happen due to call-to-finally thunks marked as "cloned finally", as described above. For example (this is pseudo-assembly-code, not C#):
+
+```
+try { // try 1
+ try { // try 2
+ System.Threading.Thread.CurrentThread.Abort();
+ } catch (System.Threading.ThreadAbortException) { // catch 2
+ ...
+ LEAVE L;
+ }
+} finally { // finally 1
+ ...
+}
+L:
+```
+
+This would generate something like:
+
+```
+ // beginning of 'try 1'
+ // beginning of 'try 2'
+ System.Threading.Thread.CurrentThread.Abort();
+ // end of 'try 2'
+ // beginning of call-to-finally 'cloned finally' region
+L1: call finally1
+ nop
+ // end of call-to-finally 'cloned finally' region
+ // end of 'try 1'
+ // function epilog
+ ret
+
+Catch2:
+ // do something
+ lea rax, &L1; // load up resume address
+ ret
+
+Finally1:
+ // do something
+ ret
+```
+
+Note that the JIT must already insert a "step" block so the finally will be called. However, this isn't sufficient to support ThreadAbortException processing, because "L1" is marked as "cloned finally". In this case, the JIT must insert another step block that is within "try 1" but outside the cloned finally block, that will allow for correct re-raise semantics. For example:
+
+```
+ // beginning of 'try 1'
+ // beginning of 'try 2'
+ System.Threading.Thread.CurrentThread.Abort();
+ // end of 'try 2'
+L1': nop
+ // beginning of call-to-finally 'cloned finally' region
+L1: call finally1
+ nop
+ // end of call-to-finally 'cloned finally' region
+ // end of 'try 1'
+ // function epilog
+ ret
+
+Catch2:
+ // do something
+ lea rax, &L1'; // load up resume address
+ ret
+
+Finally1:
+ // do something
+ ret
+```
+
+Note that JIT64 does not implement this properly. The C# compiler used to always insert all necessary "step" blocks. The Roslyn C# compiler at one point did not, but then was change to once again insert them.
+
+## The PSPSym and funclet parameters
+
+The name *PSPSym* stands for Previous Stack Pointer Symbol. It is how a funclet accesses locals from the main function body. This is not used for x86: the frame pointer on x86 is always preserved when the handlers are invoked.
+
+First, two definitions.
+
+*Caller-SP* is the value of the stack pointer in a function's caller before the call instruction is executed. That is, when function A calls function B, Caller-SP for B is the value of the stack pointer immediately before the call instruction in A (calling B) was executed. Note that this definition holds for both AMD64, which pushes the return value when a call instruction is executed, and for ARM, which doesn't. For AMD64, Caller-SP is the address above the call return address.
+
+*Initial-SP* is the initial value of the stack pointer after the fixed-size portion of the frame has been allocated. That is, before any "alloca"-type allocations.
+
+The PSPSym is a pointer-sized local variable in the frame of the main function and of each funclet. The value stored in PSPSym is the value of Initial-SP for AMD64 or Caller-SP for other platforms, for the main function. The stack offset of the PSPSym is reported to the VM in the GC information header. The value reported in the GC information is the offset of the PSPSym from Initial-SP for AMD64 or Caller-SP for other platforms. (Note that both the value stored, and the way the value is reported to the VM, differs between architectures. In particular, note that most things in the GC information header are reported as offsets relative to Caller-SP, but PSPSym on AMD64 is one exception, and maybe the only exception.)
+
+The VM uses the PSPSym to find other locals it cares about (such as the generics context in a funclet frame). The JIT uses it to re-establish the frame pointer register, so that the frame pointer is the same value in a funclet as it is in the main function body.
+
+When a funclet is called, it is passed the *Establisher Frame Pointer*. For AMD64 this is true for all funclets and it is passed as the first argument in RCX, but for ARM and ARM64 this is only true for first pass funclets (currently just filters) and it is passed as the second argument in R1. The Establisher Frame Pointer is a stack pointer of an interesting "parent" frame in the exception processing system. For the CLR, it points either to the main function frame or a dynamically enclosing funclet frame from the same function, for the funclet being invoked. The value of the Establisher Frame Pointer is Initial-SP on AMD64, Caller-SP on ARM and ARM64.
+
+Using the establisher frame, the funclet wants to load the value of the PSPSym. Since we don't know if the Establisher Frame is from the main function or a funclet, we design the main function and funclet frame layouts to place the PSPSym at an identical, small, constant offset from the Establisher Frame in each case. (This is also required because we only report a single offset to the PSPSym in the GC information, and that offset must be valid for the main function and all of its funclets). Then, the funclet uses this known offset to compute the PSPSym address and read its value. From this, it can compute the value of the frame pointer (which is a constant offset from the PSPSym value) and set the frame register to be the same as the parent function. Also, the funclet writes the value of the PSPSym to its own frame's PSPSym. This "copying" of the PSPSym happens for every funclet invocation, in particular, for every nested funclet invocation.
+
+On ARM and ARM64, for all second pass funclets (finally, fault, catch, and filter-handler) the VM restores all non-volatile registers to their values within the parent frame. This includes the frame register (`R11`). Thus, the PSPSym is not used to recompute the frame pointer register in this case, though the PSPSym is copied to the funclet's frame, as for all funclets.
+
+Catch, Filter, and Filter-handlers also get an Exception object (GC ref) as an argument (`REG_EXCEPTION_OBJECT`). On AMD64 it is the second argument and thus passed in RDX. On ARM and ARM64 this is the first argument and passed in R0.
+
+(Note that the JIT64 source code contains a comment that says, "The current CLR doesn't always pass the correct establisher frame to the funclet. Funclet may receive establisher frame of funclet when expecting that of original routine." It indicates this is the reason that a PSPSym is required in all funclets as well as the main function, whereas if the establisher frame was correctly reported, the PSPSym could be omitted in some cases.)
+
+## Funclet Return Values
+
+The filter funclet returns a simple boolean value in the normal return register (x86: `EAX`, AMD64: `RAX`, ARM/ARM64: `R0`). Non-zero indicates to the VM/EH subsystem that the corresponding filter-handler will handle the exception (i.e. begin the second pass). Zero indicates to the VM/EH subsystem that the exception is **not** handled, and it should continue looking for another filter or catch.
+
+The catch and filter-handler funclets return a code address in the normal return register that indicates where the VM should resume execution after unwinding the stack and cleaning up from the exception. This address should be somewhere in the parent funclet (or main function if the catch or filter-handler is not nested within any other funclet). Because an IL 'leave' opcode can exit out of arbitrary nesting of funclets and try bodies, the JIT is often required to inject step blocks. These are intermediate branch target(s) that then branch to the next outermost target until the real target can be directly reached via the native ABI constraints. These step blocks can also invoke finallys (see *Invoking Finallys/Non-local exits*).
+
+Finally and fault funclets do not have a return value.
+
+## Register values and exception handling
+
+Exception handling imposes certain restrictions on the usage of registers in functions with exception handling.
+
+CoreCLR and "desktop" CLR behave the same way. Windows and non-Windows implementations of the CLR both follow these rules.
+
+Some definitions:
+
+*Non-volatile* (aka *callee-saved* or *preserved*) registers are those defined by the ABI that a function call preserves. Non-volatile registers include the frame pointer and the stack pointer, among others.
+
+*Volatile* (aka *caller-saved* or *trashed*) registers are those defined by the ABI that a function call does not preserve, and thus might have a different value when the function returns.
+
+### Registers on entry to a funclet
+
+When an exception occurs, the VM is invoked to do some processing. If the exception is within a "try" region, it eventually calls a corresponding handler (which also includes calling filters). The exception location within a function might be where a "throw" instruction executes, the point of a processor exception like null pointer dereference or divide by zero, or the point of a call where the callee threw an exception but did not catch it.
+
+On AMD64, all register values that existed at the exception point in the corresponding "try" region are trashed on entry to the funclet. That is, the only registers that have known values are those of the funclet parameters.
+
+On ARM and ARM64, all registers are restored to their values at the exception point.
+
+On x86: TBD.
+
+### Registers on return from a funclet
+
+When a funclet finishes execution, and the VM returns execution to the function (or an enclosing funclet, if there is EH clause nesting), the non-volatile registers are restored to the values they held at the exception point. Note that the volatile registers have been trashed.
+
+Any register value changes made in the funclet are lost. If a funclet wants to make a variable change known to the main function (or the funclet that contains the "try" region), that variable change needs to be made to the shared main function stack frame.
+
+## x86 EH considerations
+
+The x86 model is somewhat different than the non-x86 model. X86-specific concerns are mentioned here.
+
+### catch / filter-handler regions
+
+When leaving a `catch` or `filter-handler` region, the JIT calls the helper `CORINFO_JIT_ENDCATCH` (implemented in the VM by the `JIT_EndCatch` function) before transferring control to the target location. The code to call to `CORINFO_JIT_ENDCATCH` is within the catch region itself.
+
+### finally / fault regions
+
+"finally" clauses are invoked in the non-exceptional code by the generated JIT code, and in the exceptional case by the VM. "fault" clauses are only executed in exceptional cases by the VM.
+
+On entry to the finally or fault, the top of the stack is the address that should be jumped to on exit from the finally, using a "pop eax; jmp eax" sequence. A simple 'ret' could be used, but we avoid it to avoid potentially creating an unbalanced processor call/ret buffer stack, and messing up call/ret prediction.
+
+There are no register or other stack arguments to a 'finally' or 'fault'.
+
+### ShadowSP slots
+
+X86 exception handlers (e.g., catch, finally) do not establish their own frames. They don't (really) have prologs and epilogs. However, they do use the stack, and need to restore the stack pointer of the enclosing exception handling region when the handler completes executing.
+
+To implement this requirement, for any function with EH, we create a frame-local variable to store a stack of "Shadow SP" values, or ShadowSP slots. In the JIT, the local var is called lvaShadowSPslotsVar, and in dumps it is called "EHSlots". The variable is created in lvaMarkLocalVars() and is sized as follows:
+1. 1 slot is reserved for the VM (for ICodeManager::FixContext(ppEndRegion)).
+2. 1 slot for each handler nesting level (total: ehMaxHndNestingCount).
+3. 1 slot for a filter (we do this even if there aren't any filters; size optimization opportunity to not do this if there are no filters?)
+4. 1 slot for zero termination
+
+Note that the since a slot on x86 is 4 bytes, the minimum size is 16 bytes. The idea is to have 1 slot for each handler that could be possibly be invoked at the same time. For example, for:
+
+```
+ try {
+ ...
+ } catch {
+ try {
+ ...
+ } catch {
+ ...
+ }
+ }
+```
+
+When the inner 'catch' is running, the outer 'catch' is also conceptually "on the stack", or in the middle of execution. So the maximum handler nesting count would be 2.
+
+The ShadowSP slots are filled in from the highest address downwards to the lowest address. The highest slot is reserved. The first address with a zero is a zero terminator. So, we always zero terminate by setting the second-to-highest slot to zero in the function prolog (if we didn't zero initialize all locals anyway).
+
+When calling a finally, we set the appropriate level to 0xFC (aka "finally call") and zero terminate the next-lower address.
+
+Thus, calling a finally from JIT generated code looks like:
+
+```
+ mov dword ptr [L_02+0x4 ebp-10H], 0 // This must happen before the 0xFC is written
+ mov dword ptr [L_02+0x8 ebp-0CH], 252 // 0xFC
+ push G_M52300_IG07
+ jmp SHORT G_M52300_IG04
+```
+
+In this case, `G_M52300_IG07` is not the address after the 'jmp', so a simple 'call' wouldn't work.
+
+The code this finally returns to looks like this:
+
+```
+ mov dword ptr [L_02+0x8 ebp-0CH], 0
+ jmp SHORT G_M52300_IG05
+```
+
+In this case, it zeros out the ShadowSP slot that it previously set to 0xFC, then jumps to the address that is the actual target of the leave from the finally.
+
+The JIT does this "end finally restore" by creating a GT_END_LFIN tree node, with the appropriate stack level as an operand, that generates this code.
+
+In the case of an exceptional 'finally' invocation, the VM sets up the 'return address' to whatever address it wants the JIT to return to.
+
+For catch handlers, the VM is completely in control of filling and reading the ShadowSP slots; the JIT just makes sure there is enough space.
+
+### ShadowSP slots frame location
+
+The ShadowSP slots are required to live in a very particular location, reported via the GC info header. Note that the GC info header does not contain an actual pointer or offset to the ShadowSP slots variable. Instead, the VM calculates the location from other data that does exist in the GC info header, as a negative offset from the EBP frame pointer (which must be established in functions with EH) using the function `GetFirstBaseSPslotPtr()` / `GetStartShadowSPSlotsOffset()`. The VM thus assumes the following frame layout:
+
+1. callee-saved registers <= EBP points to the top of this range
+2. GS cookie
+3. 1 slot if localloc is used (Saved localloc SP?)
+4. 1 slot for CORINFO_GENERICS_CTXT_FROM_PARAMTYPEARG -- assumed for any function with EH, to avoid adding a flag to the GC info about whether it exists or not.
+5. ShadowSP slots
+
+(note, these don't have to be in this order for this calculation, but they possibly do need to be in this order for other calculations.) See also `GetEndShadowSPSlotsOffset()`.
+
+The VM walks the ShadowSP slots in the function `GetHandlerFrameInfo()`, and sets it in various functions such as `EECodeManager::FixContext()`.
+
+### JIT implementation: finally
+
+An aside on the JIT implementation for x86.
+
+The JIT creates BBJ_CALLFINALLY/BBJ_ALWAYS pairs for calling the 'finally' clause. The BBJ_CALLFINALLY block will have a series of CORINFO_JIT_ENDCATCH calls appended at the end, if we need to "leave" a series of nested catches before calling the finally handler (due to a single 'leave' opcode attempting to leave multiple levels of different types of handlers). Then, a GT_END_LFIN statement with the finally clause handler nesting level as an argument is added to the step block where the finally returns to. This is used to generate code to zero out the appropriate level of the ShadowSP slot array after the finally has been executed. The BBJ_CALLFINALLY block itself generates the code to insert the 0xFC value into the ShadowSP slot array. If the 'finally' is invoked by the VM, in exceptional cases, then the VM itself updates the ShadowSP slot array before invoking the 'finally'.
+
+At the end of a finally or filter, a GT_RETFILT is inserted. For a finally, this is a TYP_VOID which is just a placeholder. For a filter, it takes an argument which evaluates to the return value from the filter. On legacy JIT, this tree triggers the generation of both the return value load (for filters) and the "funclet" exit sequence, which is either a "pop eax; jmp eax" for a finally, or a "ret" for a filter. When processing the BBJ_EHFINALLYRET or BBJ_EHFILTERRET block itself (at the end of code generation for the block), nothing is generated. In RyuJIT, the GT_RETFILT only loads up the return value (for filters) and does nothing for finally, and the block type processing after all the tree processing triggers the exit sequence to be generated. There is no real difference between these, except to centralize all "exit sequence" generation in the same place.
+
+# EH Info, GC Info, and Hot & Cold Splitting
+
+All GC info offsets and EH info offsets treat the function and funclets as if it was one big method body. Thus all offsets are relative to the start of the main method. Funclets are assumed to always be at the end of (after) all of the main function code. Thus if the main function has any cold code, all funclets must be cold. Or conversely, if there is any hot funclet code, all of the main method must be hot.
+
+## EH clause ordering
+
+EH clauses must be sorted inner-to-outer, first-to-last based on IL offset of the try start/try end pair. The only exceptions are cloned finallys, which always appear at the end.
+
+## How EH affects GC info/reporting
+
+Because a main function body will **always** be on the stack when one of its funclets is on the stack, the GC info must be careful not to double-report. JIT64 accomplished this by having all named locals appear in the parent method frame, anything shared between the function and funclets was homed to the stack, and only the parent function reported stack locals (funclets might report local registers). JIT32 and RyuJIT (for AMD64, ARM, and ARM64) take the opposite direction. The leaf-most funclet is responsible for reporting everything that might be live out of a funclet (in the case of a filter, this might resume back in the original method body). This is accomplished with the GC header flag WantsReportOnlyLeaf (JIT32 and RyuJIT set it, JIT64 doesn't) and the VM tracking if it has already seen a funclet for a given frame. Once JIT64 is fully retired, we should be able to remove this flag from GC info.
+
+There is one "corner case" in the VM implementation of WantsReportOnlyLeaf model that has implications for the code the JIT is allowed to generate. Consider this function with nested exception handling:
+
+```
+public void runtest() {
+ try {
+ try {
+ throw new UserException3(ThreadId); // 1
+ }
+ catch (UserException3 e){
+ Console.WriteLine("Exception3 was caught");
+ throw new UserException4(ThreadId);
+ }
+ }
+ catch (UserException4 e) { // 2
+ Console.WriteLine("Exception4 was caught");
+ }
+}
+```
+
+When the inner "throw new UserException4" is executed, the exception handling first pass finds that the outer catch handler will handle the exception. The exception handling second pass unwinds stack frames back to the "runtest" frame, and then executes the catch handler. There is a period of time during which the original catch handler ("catch (UserException3 e)") is no longer on the stack, but before the new catch handler is executed. During this time, a GC might occur. In this case, the VM needs to make sure to report GC roots properly for the "runtest" function. The inner catch has been unwound, so we can't report that. We don't want to report at "// 1", which is still on the stack, because that effectively is "going backwards" in execution, and doesn't properly represent what object references are live. We need to report live object references at the next location where execution will occur. This is the "// 2" location. However, we can't report the first location of the catch funclet, as that will be non-interruptible. The VM instead looks forward for the first interruptible point in that handler, and reports live references that the JIT reports for that location. This will be the first location after the handler prolog. There are several implications of this implementation for the JIT. It requires that:
+
+1. Methods which have EH clauses are fully interruptible.
+2. All catch funclets have an interruptible point immediately after the prolog.
+3. The first interruptible point in the catch funclet reports the following live objects on the stack
+ * Only objects that are shared with parent method i.e. no additional stack object which is live only in catch funclet and not live in parent method.
+ * All shared objects which are referenced in catch funclet and any subsequent control flow are reported live.
+
+## Filter GC semantics
+
+Filters are invoked in the 1st pass of EH processing and as such execution might resume back at the faulting address, or in the filter-handler, or someplace else. Because the VM must allow GC's to occur during and after a filter invocation, but before the EH subsystem knows where it will resume, we need to keep everything alive at both the faulting address **and** within the filter. This is accomplished by 3 means: (1) the VM's stackwalker and GCInfoDecoder report as live both the filter frame and its corresponding parent frame, (2) the JIT encodes all stack slots that are live within the filter as being pinned, and (3) the JIT reports as live (and possible zero-initializes) anything live-out of the filter. Because of (1) it is likely that a stack variable that is live within the filter and the try body will be double reported. During the mark phase of the GC double reporting is not a problem. The problem only arises if the object is relocated: if the same location is reported twice, the GC will try to relocate the address stored at that location twice. Thus we prevent the object from being relocated by pinning it, which leads us to why we must do (2). (3) is done so that after the filter returns, we can still safely incur a GC before executing the filter-handler or any outer handler within the same frame.
+
+## Duplicated Clauses
+
+Duplicated clauses are a special set of entries in the EH tables to assist the VM. Specifically, if handler 'A' is also protected by an outer EH clause 'B', then the JIT must emit a duplicated clause, a duplicate of 'B', that marks the whole handler 'A' (which is now lexically disjoint for the range of code for the corresponding try body 'A') as being protected by the handler for 'B'.
+
+Duplicated clauses are not needed for x86.
+
+During exception dispatch the VM uses these duplicated clauses to know when to skip any frames between the handler and its parent function. After skipping to the parent function, due to a duplicated clause, the VM searches for a regular/non-duplicate clause in the parent function. The order of duplicated clauses is important. They should appear after all of the main function clauses. They should still follow the normal sorting rules (inner-to-outer, top-to-bottom), but because the try-start/try-end will all be the same for a given handler, they should maintain the ordering, regarding inner-to-outer, as the corresponding original clause.
+
+Example:
+
+```
+A: try {
+B: ...
+C: try {
+D: ...
+E: try {
+F: ...
+G: }
+H: catch {
+I: ...
+J: }
+K: ...
+L: }
+M: finally {
+N: ...
+O: }
+P: ...
+Q: }
+R: catch {
+S: ...
+T: }
+```
+
+In MSIL this would generate 3 EH clauses:
+
+```
+.try E-G catch H-J
+.try C-L finally M-O
+.try A-Q catch R-T
+```
+
+The native code would be laid out as follows (the order of the handlers is irrelevant except they are after the main method body) with their corresponding (fake) native offsets:
+
+```
+A: -> 1
+B: -> 2
+C: -> 3
+D: -> 4
+E: -> 5
+F: -> 6
+G: -> 7
+K: -> 8
+L: -> 9
+P: -> 10
+Q: -> 11
+H: -> 12
+I: -> 13
+J: -> 14
+M: -> 15
+N: -> 16
+O: -> 17
+R: -> 18
+S: -> 19
+T: -> 20
+```
+
+The native EH clauses would be listed as follows:
+
+```
+1. .try 5-7 catch 12-14 (top-most & inner-most first)
+2. .try 3-9 finally 15-17 (top-most & next inner-most)
+3. .try 1-11 catch 18-20 (top-most & outer-most)
+4. .try 12-14 finally 15-17 duplicated (inner-most because clause 2 is inside clause 3, top-most because handler H-J is first)
+5. .try 12-14 catch 18-20 duplicated
+6. .try 15-17 catch 18-20
+```
+
+If the handlers were in a different order, then clause 6 might appear before clauses 4 and 5, but never in between.
+
+## GC Interruptibility and EH
+
+The VM assumes that anytime a thread is stopped, it must be at a GC safe point, or the current frame is non-resumable (i.e. a throw that will never be caught in the same frame). Thus effectively all methods with EH must be fully interruptible (or at a minimum all try bodies). Currently the GC info appears to support mixing of partially interruptible and fully-interruptible regions within the same method, but no JIT uses this, so use at your own risk.
+
+The debugger always wants to stop at GC safe points, and thus debuggable code should be fully interruptible to maximize the places where the debugger can safely stop. If the JIT creates non-interruptible regions within fully interruptible code, the code should ensure that each sequence point begins on an interruptible instruction.
+
+AMD64/JIT64 only: The JIT will add an interruptible NOP if needed.
+
+## Security Object
+
+The security object is a GC pointer and must be reported as such, and kept alive the duration of the method.
+
+## GS Cookie
+
+The GS Cookie is not a GC object, but still needs to be reported. It can only have one lifetime due to how it is encoded/reported in the GC info. Since the GS Cookie ceases being valid once we pop the stack, the epilog cannot be part of the live range. Since we only get one live range that means there cannot be any code (except funclets) after the epilog in methods with a GS cookie.
+
+## NOPs and other Padding
+
+### AMD64 padding info
+
+The unwind callbacks don't know if the current frame is a leaf or a return address. Consequently, the JIT must ensure that the return address of a call is in the same region as the call. Specifically, the JIT must add a NOP (or some other instruction) after any call that otherwise would directly precede the start of a try body, the end of a try body, or the end of a method.
+
+The OS has an optimization in the unwinder such that if an unwind results in a PC being within (or at the start of) an epilog, it assumes that frame is unimportant and unwinds again. Since the CLR considers every frame important, it does not want this double-unwind behavior and requires the JIT to place a NOP (or other instruction) between the any call and any epilog.
+
+### ARM and ARM64 padding info
+
+The OS unwinder uses the `RUNTIME_FUNCTION` extents to determine which function or funclet to unwind out of. The net result is that a call (bl opcode) to `IL_Throw` cannot be the last thing. So similar to AMD64 the JIT must inject an opcode (a breakpoint in this case) when the `bl IL_Throw` would otherwise be the last opcode of a function or funclet, the last opcode before the end of the hot section, or (this might be an x86-ism leaking into ARM) the last before a "special throw block".
+
+The CLR unwinder assumes any non-leaf frame was unwound as a result of a call. This is mostly (always?) true except for non-exceptional finally invocations. For those cases, the JIT must place a 2 byte NOP **before** the address set as the finally return address (in the LR register, before jumping to the finally). I believe this is only needed if the preceding 2 bytes would have otherwise been in a different region (i.e. the end or start of a try body, etc.), but currently the JIT always emits the NOP. This is because the stack walker looks at the return address, subtracts 2, and uses that as the PC for the next step of stack walking. Note that the inserted NOP must have correct GC information.
+
+# Profiler Hooks
+
+If the JIT gets passed `CORJIT_FLG_PROF_ENTERLEAVE`, then the JIT might need to insert native entry/exit/tail call probes. To determine for sure, the JIT must call GetProfilingHandle. This API returns as out parameters, the true dynamic boolean indicating if the JIT should actually insert the probes and a parameter to pass to the callbacks (typed as void*), with an optional indirection (used for NGEN). This parameter is always the first argument to all of the call-outs (thus placed in the usual first argument register `RCX` (AMD64) or `R0` (ARM, ARM64)).
+
+Outside of the prolog (in a GC interruptible location), the JIT injects a call to `CORINFO_HELP_PROF_FCN_ENTER`. For AMD64, all argument registers will be homed into their caller-allocated stack locations (similar to varargs). For ARM and ARM64, all arguments are prespilled (again similar to varargs).
+
+After computing the return value and storing it in the correct register, but before any epilog code (including before a possible GS cookie check), the JIT injects a call to `CORINFO_HELP_PROF_FCN_LEAVE`. For AMD64 this call must preserve the return register: `RAX` or `XMM0`. For ARM, the return value will be moved from `R0` to `R2` (if it was in `R0`), `R1`, `R2`, and `S0/D0` must be preserved by the callee (longs will be `R2`, `R1` - note the unusual ordering of the registers, floats in `S0`, doubles in `D0`, smaller integrals in `R2`).
+
+TODO: describe ARM64 profile leave conventions.
+
+Before the argument setup (but after any argument side-effects) for any tail calls or jump calls, the JIT injects a call to `CORINFO_HELP_PROF_FCN_TAILCALL`. Note that it is NOT called for self-recursive tail calls turned into loops.
+
+For ARM tail calls, the JIT actually loads the outgoing arguments first, and then just before the profiler call-out, spills the argument in `R0` to another non-volatile register, makes the call (passing the callback parameter in `R0`), and then restores `R0`.
+
+For AMD64, all probes receive a second parameter (passed in `RDX` according to the default argument rules) which is the address of the start of the arguments' home location (equivalent to the value of the caller's stack pointer).
+
+TODO: describe ARM64 tail call convention.
+
+JIT32 only generates one epilog (and causes all returns to branch to it) when there are profiler hooks.
+
+# Synchronized Methods
+
+JIT32/RyuJIT only generates one epilog (and causes all returns to branch to it) when a method is synchronized. See `Compiler::fgAddSyncMethodEnterExit()`. The user code is wrapped in a try/finally. Outside/before the try body, the code initializes a boolean to false. `CORINFO_HELP_MON_ENTER` or `CORINFO_HELP_MON_ENTER_STATIC` are called, passing the lock object (the "this" pointer for instance methods or the Type object for static methods) and the address of the boolean. If the lock is acquired, the boolean is set to true (as an 'atomic' operation in the sense that a Thread.Abort/EH/GC/etc. cannot interrupt the Thread when the boolean does not match the arquired state of the lock). JIT32/RyuJIT follows the exact same logic and arguments for placing the call to `CORINFO_HELP_MON_EXIT` / `CORINFO_HELP_MON_EXIT_STATIC` in the finally.
+
+# Rejit
+
+For AMD64 to support profiler attach scenarios, the JIT can be required to ensure every generated method is hot patchable (see `CORJIT_FLG_PROF_REJIT_NOPS`). The way we do this is to ensure that the first 5 bytes of code are non-interruptible and there is no branch target within those bytes (includes calls/returns). Thus the VM can stop all threads (like for a GC) and safely replace those 5 bytes with a branch to a new version of the method (presumably instrumented by a profiler). The JIT adds NOPs or increases the size of the prolog reported in the GC info to accomplish these 2 requirements.
+
+In a function with exception handling, only the main function is affected; the funclet prologs are not made hot patchable.
+
+# Edit and Continue
+
+Edit and Continue (EnC) is a special flavor of un-optimized code. The debugger has to be able to reliably remap a method state (instruction pointer and local variables) from original method code to edited method code. This puts constraints on the method stack layout performed by the JIT. The key constraint is that the addresses of the existing locals must stay the same after the edit. This constraint is required because the address of the local could have been stored in the method state.
+
+In the current design, the JIT does not have access to the previous versions of the method and so it has to assume the worst case. EnC is designed for simplicity, not for performance of the generated code.
+
+EnC is currently enabled on x86 and x64 only, but the same principles would apply if it is ever enabled on other platforms.
+
+The following sections describe the various Edit and Continue code conventions that must be followed.
+
+## EnC flag in GCInfo
+
+The JIT records the fact that it has followed conventions for EnC code in GC Info. On x64, this flag is implied by recording the size of the stack frame region preserved between EnC edits (`GcInfoEncoder::SetSizeOfEditAndContinuePreservedArea`). For normal methods on JIT64, the size of this region is 2 slots (saved `RBP` and return address). On RyuJIT/AMD64, the size of this region is increased to include `RSI` and `RDI`, so that `rep stos` can be used for block initialization and block moves.
+
+## Allocating local variables backward
+
+This is required to preserve addresses of the existing locals when an EnC edit appends new ones. In other words, the first local must be allocated at the highest stack address. Special care has to be taken to deal with alignment. The total size of the method frame can either grow (more locals added) or shrink (fewer temps needed) after the edit. The VM zeros out newly added locals.
+
+## Fixed set of callee-saved registers
+
+This eliminates need to deal with the different sets in the VM, and makes preservation of local addresses easier. On x64, we choose to always save `RBP` only. There are plenty of volatile registers and so lack of non-volatile registers does not impact quality of non-optimized code.
+
+## EnC is supported for methods with EH
+
+However, EnC remap is not supported inside funclets. The stack layout of funclets does not matter for EnC.
+
+## Initial RSP == RBP == PSPSym
+
+This invariant allows VM to compute new value of `RBP` and PSPSym after the edit without any additional information. Location of PSPSym is found via GC info.
+
+## Localloc
+
+Localloc is allowed in EnC code, but remap is disallowed after the method has executed a localloc instruction. VM uses the invariant above (`RSP == RBP`) to detect whether localloc was executed by the method.
+
+## Security object
+
+This does not require any special handling by the JIT on x64. (Different from x86). The security object is copied over by the VM during remap if necessary. Location of security object is found via GC info.
+
+## Synchronized methods
+
+The extra state created by the JIT for synchronized methods (original "this" and lock taken flag) must be preserved during remap. The JIT stores this state in the preserved region, and increases the size of the preserved region reported in GC info accordingly.
+
+## Generics
+
+EnC is not supported for generic methods and methods on generic types.
+
+# System V x86_64 support
+
+This section relates mostly to calling conventions on System V systems (such as Ubuntu Linux and Mac OS X).
+The general rules outlined in the System V x86_64 ABI (described at http://www.x86-64.org/documentation/abi.pdf) are followed with a few exceptions, described below:
+
+1. The hidden argument for by-value passed structs is always after the "this" parameter (if there is one). This is a difference with the System V ABI and affects only the internal JIT calling conventions. For PInvoke calls the hidden argument is always the first parameter since there is no "this" parameter in this case.
+2. Managed structs that have no fields are always passed by-value on the stack.
+3. The JIT proactively generates frame register frames (with `RBP` as a frame register) in order to aid the native OS tooling for stack unwinding and the like.
+4. All the other internal VM contracts for PInvoke, EH, and generic support remains in place. Please see the relevant sections above for more details. Note, however, that the registers used are different on System V due to the different calling convention. For example, the integer argument registers are, in order, RDI, RSI, RDX, RCX, R8, and R9. Thus, where the first argument (typically, the "this" pointer) on Windows AMD64 goes in RCX, on System V it goes in RDI, and so forth.
+5. Structs with explicit layout are always passed by value on the stack.
diff --git a/Documentation/botr/dac-notes.md b/Documentation/botr/dac-notes.md
new file mode 100644
index 0000000000..adeb9a8cee
--- /dev/null
+++ b/Documentation/botr/dac-notes.md
@@ -0,0 +1,213 @@
+Data Access Component (DAC) Notes
+=================================
+
+Date: 2007
+
+Debugging managed code requires special knowledge of managed objects and constructs. For example, objects have various kinds of header information in addition to the data itself. Objects may move in memory as the garbage collector does its work. Getting type information may require help from the loader. Retrieving the correct version of a function that has undergone an edit-and-continue or getting information for a function emitted through reflection requires the debugger to be aware of EnC version numbers and metadata. The debugger must be able to distinguish AppDomains and assemblies. The code in the VM directory embodies the necessary knowledge of these managed constructs. This essentially means that APIs to retrieve information about managed code and data must run some of the same algorithms that the execution engine itself runs.
+
+Debuggers can operate either _in-process_ or _out-of-process_. A debugger that runs in-process requires a live data target (the debuggee). In this case, the runtime has been loaded and the target is running. A helper thread in the debuggee runs code from the execution engine to compute the information the debugger needs. Because the helper thread runs in the target process, it has ready access to the target's address space and the runtime code. All the computation occurs in the target process. This is a simple way to get the information the debugger needs to be able to represent managed constructs in a meaningful way. Nevertheless, an in-process debugger has certain limitations. For example, if the debuggee is not currently running (as is the case when the debuggee is a dump file), the runtime is not loaded (and may not even be available on the machine). In this case, the debugger has no way to execute runtime code to get the information it needs.
+
+Historically, the CLR debugger has operated in process. A debugger extension, SOS (Son of Strike) or Strike (in the early CLR days) can be used to inspect managed code. Starting with .NET Framework 4, the debugger runs out-of-process. The CLR debugger APIs provide much of the functionality of SOS along with other functionality that SOS does not provide. Both SOS and the CLR debugging APIs use the Data Access Component (DAC) to implement out-of-process debugging. The DAC is conceptually a subset of the runtime's execution engine code that runs out-of-process. This means that it can operate on a dump file, even on a machine that has no runtime installed. Its implementation consists mainly of a set of macros and templates, combined with conditional compilation of the execution engine's code. When the runtime is built, both clr.dll and mscordacwks.dll. For CoreCLR builds, the binaries are slightly different: coreclr.dll and msdaccore.dll. The file names also differ when built for other operating systems, like OS X. To inspect the target, the DAC can read its memory to get the inputs for the VM code in mscordacwks. It can then run the appropriate functions in the host to compute the information needed about a managed construct and finally return the results to the debugger.
+
+Notice that the DAC reads _the memory of the target process_. It's important to realize that the debugger and the debuggee are separate processes with separate address spaces. Thus it is important to make a clear distinction between target memory and host memory. Using a target address in code running in the host process would have completely unpredictable and generally incorrect results. When using the DAC to retrieve memory from the target, it is important to be very careful to use addresses from the correct address space. Furthermore, sometimes the target addresses are sometimes strictly used as data. In this case, it would be just as incorrect to use a host address. For example, to display information about a managed function, we might want to list its starting address and size. Here, it is important to provide the target address. When writing code in the VM that the DAC will run, one needs to correctly choose when to use host and target addresses.
+
+The DAC infrastructure (the macros and templates that control how host or target memory is accessed) supplies certain conventions that distinguish which pointers are host addresses and which are target addresses. When a function is _DACized_ (i.e., use the DAC infrastructure to make the function work out of process), host pointers of type _T _are declared to be of type _T _\*. Target pointers are of type PTR\ __T_. Remember though, that the concept of host versus target is only meaningful for the DAC. In a non-DAC build, we have only a single address space. The host and the target are the same: the CLR. If we declare a local variable of either type _T \* _or of type PTR\_T in a VM function, it will be a "host pointer" When we are executing code in clr.dll (coreclr.dll), there is absolutely no difference between a local variable of type _T \* _and a local variable of type PTR\__ T._ If we execute the function compiled into mscordacwks.dll (msdaccore.dll) from the same source, the variable declared to be of type _T \*_ will be a true host pointer, with the debugger as the host. If you think about it, this is obvious. Nevertheless it can become confusing when we start passing these pointers to other VM functions. When we are DACizing a function (i.e., changing _T \*_ to PTR\__T_, as appropriate), we sometimes need to trace a pointer back to its point of origin to determine whether it should be a host or target type.
+
+When one has no understanding of the DAC, it's easy to find the use of the DAC infrastructure annoying. The TADDRs and PTR\_this and dac\_casts, etc. seem to clutter the code and make it harder to understand. With just a little work, though, you'll find that these are not really difficult to learn. Keeping host and target addresses explicitly different is really a form of strong typing. The more diligent we are, the easier it becomes to ensure our code is correct.
+
+Because the DAC potentially operates on a dump, the part of the VM sources we build in clr.dll (msdaccore.dll) must be non-invasive. Specifically, we usually don't want to do anything that would cause writing to the target's address space, nor can we execute any code that might cause an immediate garbage collection. (If we can defer the GC, it may be possible to allocate.) Note that the _host_ state is always mutated (temporaries, stack or local heap values); it is only mutating the _target_ space that is problematic. To enforce this, we do two things: code factoring and conditional compilation. In an ideal world, we would factor the VM code so that we would strictly isolate invasive actions in functions that are separate from non-invasive functions.
+
+Unfortunately, we have a large code base, most of which we wrote without ever thinking about the DAC at all. We have a significant number of functions with "find or create" semantics and many other functions that have some parts that just do inspection and other parts that write to the target. Sometimes we control this with a flag passed into the function. This is common in loader code, for example. To avoid having to complete the immense job of refactoring all the VM code before we can use the DAC, we have a second method to prevent executing invasive code from out of process. We have a defined pre-processor constant, DACCESS\_COMPILE that we use to control what parts of the code we compile into the DAC. We would like to use the DACCESS\_COMPILE constant as little as we can, so when we DACize a new code path, we prefer to refactor whenever possible. Thus, a function that has "find or create" semantics should become two functions: one that tries to find the information and a wrapper that calls this and creates if the find fails. That way, the DAC code path can call the find function directly and avoid the creation.
+
+How does the DAC work?
+======================
+
+As discussed, the DAC works by marshaling the data it needs and running code in the mscordacwks.dll (msdaccore.dll) module. It marshals data by reading from the target address space to get a target value, and then storing it in the host address space where the functions in mscordacwks can operate on it. This happens only on demand, so if the mscordacwks functions never need a target value, the DAC will not marshal it.
+
+Marshaling Principles
+---------------------
+
+The DAC maintains a cache of data that it reads. This avoids the overhead of reading the same values repeatedly. Of course, if the target is live, the values will potentially change. We can only assume the cached values are valid as long as the debuggee remains stopped. Once we allow the target to continue execution, we must flush the DAC cache. The DAC will retrieve the values again when the debugger stops the target for further inspection. The entries in the DAC cache are of type DAC\_INSTANCE. This contains (among other data) the target address, the size of the data and space for the marshaled data itself. When the DAC marshals data, it returns the address of the marshaled data part of this entry as the host address.
+
+When the DAC reads a value from the target, it marshals the value as a chunk of bytes of a given size (determined by its type). By keeping the target address as a field in the cache entries, it maintains a mapping between the target address and the host address (the address in the cache). Between any stop and continue of a debugger session, the DAC will marshal each value requested only once, as long as subsequent accesses use the same type. (If we reference the target address by two different types, the size may be different, so the DAC will create a new cache entry for the new type). If the value is already in the cache, the DAC will be able to look it up by its target address. That means we can correctly compare two host pointers for (in)equality as long as we have accessed both pointers using the same type. This identity of pointers does not hold across type conversions however. Furthermore, we have no guarantee that values marshaled separately will maintain the same spatial relationship in the cache that they do in the target, so it is incorrect to compare two host pointers for less-than or greater-than relationships. Object layout must be identical in host and targe, so we can access fields in an object in the cache using the same offsets we use in the target. Remember that any pointer fields in a marshaled object will be target addresses (generally declared as data members of a PTR type). If we need the values at those addresses, the DAC must marshal them to the host before dereferencing them.
+
+Because we build this dll from the same sources that we use to build mscorwks.dll (coreclr.dll), the mscordacwks.dll (msdaccore.dll) build that the debugger uses must match the mscorwks build exactly. You can see that this is obviously true if you consider that between builds we might add or remove a field from a type we use. The size for the object in mscorwks would then be different from the size in mscordacwks and the DAC could not marshal the object correctly. This has a ramification that's obvious when you think about it, but easy to overlook. We cannot have fields in objects that exist only in DAC builds or only in non-DAC builds. Thus, a declaration such as the following would lead to incorrect behavior.
+
+ class Foo
+ {
+ ...
+ int nCount;
+
+ // DON'T DO THIS!! Object layout must match in DAC builds
+ #ifndef DACCESS_COMPILE
+
+ DWORD dwFlags;
+
+ #endif
+
+ PTR_Bar pBar;
+ ...
+ };
+
+Marshaling Specifics
+--------------------
+
+DAC marshaling works through a collection of typedefs, macros and templated types that generally have one meaning in DAC builds and a different meaning in non-DAC builds. You can find these declarations in [src\inc\daccess.h][daccess.h]. You will also find a long comment at the beginning of this file that explains the details necessary to write code that uses the DAC.
+
+[daccess.h]: https://github.com/dotnet/coreclr/blob/master/src/inc/daccess.h
+
+An example may be helpful in understanding how marshaling works. The common debugging scenario is represented in the following block diagram:
+
+![DAC Overview](../images/dac-overview.png)
+
+The debugger in this figure could be Visual Studio, MDbg, WinDbg, etc. The debugger interfaces with the CLR debugger interface (DBI) APIs to get the information it needs. Information that must come from the target goes through the DAC. The debugger implements the data target, which is responsible for implementing a ReadVirtual function to read memory in the target. The dotted line in the diagram represents the process boundary.
+
+Suppose the debugger needs to display the starting address of an ngen'ed method in the managed application that it has gotten from the managed stack. We will assume that the debugger has already gotten an instance of ICorDebugFunction back from the DBI. It will begin by calling the DBI API ICorDebugFunction::GetNativeCode. This calls into the DAC through the DAC/DBI interface function GetNativeCodeInfo, passing in the domain file and metadata token for the function. The following code fragment is a simplification of the actual function, but it illustrates marshaling without introducing extraneous details.
+
+ void DacDbiInterfaceImpl::GetNativeCodeInfo(TADDR taddrDomainFile,
+ mdToken functionToken,
+ NativeCodeFunctionData \* pCodeInfo)
+ {
+ ...
+
+ DomainFile \* pDomainFile = dac\_cast<PTR\_DomainFile>(taddrDomainFile);
+ Module \* pModule = pDomainFile->GetCurrentModule();
+
+ MethodDesc\* pMethodDesc = pModule->LookupMethodDef (functionToken);
+ pCodeInfo->pNativeCodeMethodDescToken = pMethodDesc;
+
+ // if we are loading a module and trying to bind a previously set breakpoint, we may not have
+ // a method desc yet, so check for that situation
+ if(pMethodDesc != NULL)
+ {
+ pCodeInfo->startAddress = pMethodDesc->GetNativeCode();
+ ...
+ }
+ }
+
+The first step is to get the module in which the managed function resides. The taddrDomainFile parameter we pass in represents a target address, but we will need to be able to dereference it here. This means we need the DAC to marshal the value. The dac\_cast operator will construct a new instance of PTR\_DomainFile with a target address equal to the value of domainFileTaddr. When we assign this to pDomainFile, we have an implicit conversion to the host pointer type. This conversion operator is a member of the PTR type and this is where the marshaling occurs. The DAC first searches its cache for the target address. If it doesn't find it, it reads the data from the target for the marshaled DomainFile instance and copies it to the cache. Finally, it returns the host address of the marshaled value.
+
+Now we can call GetCurrentModule on this host instance of the DomainFile. This function is a simple accessor that returns DomainFile::m\_pModule. Notice that it returns a Module \*, which will be a host address. The value of m\_pModule is a target address (the DAC will have copied the DomainFile instance as raw bytes). The type for the field is PTR\_Module, however, so when the function returns it, the DAC will automatically marshal it as part of the conversion to Module \*. That means the return value is a host address. Now we have the correct module and a method token, so we have all the information we need to get the MethodDesc.
+
+ Module * DomainFile::GetCurrentModule()
+ {
+ LEAF_CONTRACT;
+ SUPPORTS_DAC;
+ return m_pModule;
+ }
+
+In this simplified version of the code, we are assuming that the method token is a method definition. The next step, then, is to call the LookupMethodDef function on the Module instance.
+
+ inline MethodDesc \*Module::LookupMethodDef(mdMethodDef token)
+ {
+ WRAPPER\_CONTRACT;
+ SUPPORTS\_DAC;
+ ...
+ return dac\_cast<PTR\_MethodDesc>(GetFromRidMap(&m\_MethodDefToDescMap,
+ RidFromToken(token)));
+ }
+
+This uses the RidMap to lookup the MethodDesc. If you look at the definition for this function, you will see that it returns a TADDR:
+
+ TADDR GetFromRidMap(LookupMap \*pMap, DWORD rid)
+ {
+ ...
+
+ TADDR result = pMap->pTable[rid];
+ ...
+ return result;
+ }
+
+This represents a target address, but it's not really a pointer; it's simply a number (although it represents an address). The problem is that LookupMethodDef needs to return the address of a MethodDesc that we can dereference. To accomplish this, the function uses a dac\_cast to PTR\_MethodDesc to convert the TADDR to a PTR\_MethodDesc. You can think of this as the target address space form of a cast from void \* to MethodDesc \*. In fact, this code would be slightly cleander if GetFromRidMap returned a PTR\_VOID (with pointer semantics) instead of a TADDR (with integer semantics). Again, the type conversion implicit in the return statement ensures that the DAC marshals the object (if necessary) and returns the host address of the MethodDesc in the DAC cache.
+
+The assignment statement in GetFromRidMap indexes an array to get a particular value. The pMap parameter is the address of a structure field from the MethodDesc. As such, the DAC will have copied the entire field into the cache when it marshaled the MethodDesc instance. Thus, pMap, which is the address of this struct, is a host pointer. Dereferencing it does not involve the DAC at all. The pTable field, however, is a PTR\_TADDR. What this tells us is that pTable is an array of target addresses, but its type indicates that it is a marshaled type. This means that pTable will be a target address as well. We dereference it with the overloaded indexing operator for the PTR type. This will get the target address of the array and compute the target address of the element we want. The last step of indexing marshals the array element back to a host instance in the DAC cache and returns its value. We assign the the element (a TADDR) to the local variable result and return it.
+
+Finally, to get the code address, the DAC/DBI interface function will call MethodDesc::GetNativeCode. This function returns a value of type PCODE. This type is a target address, but one that we cannot dereference (it is just an alias of TADDR) and one that we use specifically to specify a code address. We store this value on the ICorDebugFunction instance and return it to the debugger.
+
+### PTR Types
+
+Because the DAC marshals values from the target address space to the host address space, understanding how the DAC handles target pointers is fundamental. We collectively refer to the fundamental types used for marshaling these as "PTR types." You will see that [daccess.h][daccess.h] defines two classes: \_\_TPtrBase, which has several derived types, and \_\_GlobalPtr. We don't use these types directly; we use them only indirectly through a number of macros. Each of these contains a single data member to give us the target address of the value. For \_\_TPtrBase, this is a full address. For \_\_GlobalPtr, it is a relative address, referenced from a DAC global base location. The "T" in \_\_TPtrBase stands for "target". As you can guess, we use types derived from \_\_TPtrBase for pointers that are data members or locals and we use \_\_GlobalPtr for globals and statics.
+
+In practice, we use these types only through macros. The introductory comment in [daccess.h][daccess.h] has examples of the use of all of these. What is interesting about these macros is that they will expand to declare instantiated types from these marshaling templates in DAC builds, but are no-ops in non-DAC builds. For example, the following definition declares PTR\_MethodTable as a type to represent method table pointers (note that the convention is to name these types with a prefix of PTR\_):
+
+ typedef DPTR(class MethodTable) PTR\_MethodTable;
+
+In a DAC build, the DPTR macro will expand to declare a \_\_DPtr<MethodTable> type named PTR\_MethodTable. In a non-DAC build, the macro simply declares PTR\_MethodTable to be MethodTable \*. This implies that the DAC functionality does not result in any behavior change or performance degradation in non-DAC builds.
+
+Even better, in a DAC build, the DAC will automatically marshal variables, data members, or return values declared to be of type PTR\_MethodTable, as we saw in the example in the last section. The marshaling is completely transparent. The \_\_DPtr type has overloaded operator functions to redefine pointer dereferencing and array indexing, and a conversion operator to cast to the host pointer type. These operations determine whether the requested value is already in the cache, from whence the operators will return them immediately, or whether it is necessary to read from the target and load the value into the cache before returning it. If you are interested in understanding the details, the function responsible for these cache operations is DacInstantiateTypeByAddressHelper.
+
+PTR types defined with DPTR are the most common in the runtime, but we also have PTR types for global and static pointers, restricted-use arrays, pointers to variable-sized objects, and pointers to classes with virtual functions that we may need to call from mscordacwks.dll (msdaccore.dll). Most of these are rare and you can refer to [daccess.h][daccess.h] to learn more about them if you need them.
+
+The GPTR and VPTR macros are common enough to warrant special mention here. Both the way we use these and their external behavior is quite similar to DPTRs. Again, marshaling is automatic and transparent. The VPTR macro declares a marshaled pointer type for a class with virtual functions. This special macro is necessary because the virtual function table is essentially an implicit extra field. The DAC has to marshal this separately, since the function addresses are all target addresses that the DAC must convert to host addresses. Treating these classes in this way means that the DAC automatically instantiates the correct implementation class, making casts between base and derived types unnecessary. When you declare a VPTR type, you must also list it in vptr\_list.h. \_\_GlobalPtr types provide base functionality to marshal both global variables and static data members through the GPTR, GVAL, SPTR and SVAL macros. The implementation of global variables is almost identical to that of static fields (both use the \_\_GlobalPtr class) and require the addition of an entry in [dacvars.h][dacvars.h]. The comments in daccess.h and dacvars.h provide more details about declaring these types.
+
+[dacvars.h]: https://github.com/dotnet/coreclr/blob/master/src/inc/dacvars.h
+
+Global and static values and pointers are interesting because they form the entry points to the target address space (all other uses of the DAC require you to have a target address already). Many of the globals in the runtime are already DACized. It occasionally becomes necessary to make a previously unDACized (or a newly introduced) global available to the DAC. By using the appropriate macros and [dacvars.h][dacvars.h] entry, you enable a post-build step (DacTableGen.exe run by the build in ndp\clr\src\dacupdatedll) to save the address of the global (from clr.pdb) into a table that is embedded into mscordacwks.dll. The DAC uses this table at run-time to determine where to look in the target address space when the code accesses a global.
+
+### VAL Types
+
+In addition to pointer types, the DAC must also marshal static and global values (as opposed to values referenced by static or global pointers). For this we have a collection of macros ?VAL\_\*. We use GVAL\_\* for global values, and SVAL\_\* for static values. The comment in the [daccess.h][daccess.h] file has a table showing how to use the various forms of these and includes instructions for declaring global and static values (and global and static pointers) that we will use in DACized code.
+
+### Pure Addresses
+
+The TADDR and PCODE types we introduced in the example of DAC operation are pure target addresses. These are actually integer types, rather than pointers. This prevents code in the host from incorrectly dereferencing them. The DAC does not treat them as pointers either. Specifically, because we have no type or size information no dereferencing or marshalling can occur. We use these primarily in two situations: when we are treating a target address as pure data and when we need to do pointer arithmetic with target addresses (although we can also do pointer arithmetic with PTR types). Of course, because TADDRs have no type information for the target locations they specify, when we perform address arithmetic, we need to factor in the size explicitly.
+
+We also have one special class of PTRs that don't involve marshaling: PTR\_VOID and PTR\_CVOID. These are the target equivalents of void \* and const void \*, respectively. Because TADDRs are simply numbers, they don't have pointer semantics, which means that if we DACize code by converting void \* to TADDR (as was often the case in the past), we often need extra casts and other changes, even in code that does not compile for the DAC. Using PTR\_VOID makes it easier and cleaner to DACize code that uses void \* by preserving the semantics expected for void \*. If we DACize a function that uses PTR\_VOID or PTR\_CVOID, we can't directly marshal data from these addresses, since we have no idea how much data we would need to read. This means we can't dereference them (or even do pointer aritmetic), but this is identical to the semantics of void \*. As is the case for void \*, we generally cast them to a more specific PTR type when we need to use them. We also have a PTR\_BYTE type, which is a standard marshaled target pointer (that supports pointer arithmetic, etc.). In general, when we DACize code, void \* becomes PTR\_VOID and BYTE \* becomes PTR\_BYTE, just as you would expect. [daccess.h][daccess.h] has explanatory comments that provide more details about the use and semantics of the PTR\_VOID type.
+
+Occasionally, legacy code stores a target address in a host pointer type such as void \*. This is always a bug and makes it extremely difficult to reason about the code. It will also break when we support cross-platform, where the pointer types are different sizes). In DAC builds, the void \* type is a host pointer which should never contain a target address. Using PTR\_VOID instead allows us to indicate that a void pointer type is a target address. We are trying to eliminate all such uses, but some are quite pervasive in the code and will take a while to eliminate entirely.
+
+### Conversions
+
+In earlier CLR versions, we used C-style type casting, macros, and constructors to cast between types. For example, in MethodIterator::Next, we have the following:
+
+ if (methodCold)
+ {
+ PTR_CORCOMPILE_METHOD_COLD_HEADER methodColdHeader
+ = PTR_CORCOMPILE_METHOD_COLD_HEADER((TADDR)methodCold);
+
+ if (((TADDR)methodCode) == PTR_TO_TADDR(methodColdHeader->hotHeader))
+ {
+ // Matched the cold code
+ m_pCMH = PTR_CORCOMPILE_METHOD_COLD_HEADER((TADDR)methodCold);
+ ...
+
+Both methodCold and methodCode are declared as BYTE \*, but in fact hold target addresses. In line 4, methodCold is casted to a TADDR and used as the argument to the constructor for PTR\_CORCOMPILE\_METHOD\_COLD\_HEADER. At this point, methodColdHeader is explicitly a target address. In line 6, there is another C-style cast for methodCode. The hotHeader field of methodColdHeader is of type PTR\_CORCOMPILE\_METHOD\_HEADER. The macro PTR\_TO\_TADDR extracts the raw target address from this PTR type and assigns it to methodCode. Finally, in line 9, another instance of type PTR\_CORCOMPILE\_METHOD\_COLD\_HEADER is constructed. Again, methodCold is casted to TADDR to pass to this constructor.
+
+If this code seems overly complex and confusing to you, that's good. In fact it is. Worse, it provides no protection for the separation of host and target addresses. From the declarations of methodCold and methodCode, there is no particular reason to interpret them as target addresses at all. If these pointers were dereferenced in DAC builds as if they really were host pointers, the process would probably AV. This snippet demonstrates that any arbitrary pointer type (as opposed to a PTR type) can be casted to a TADDR. Given that these two variables always hold target addresses, they should be of type PTR\_BYTE, rather than BYTE \*.
+
+There is also a disciplined means to cast between different PTR types: dac\_cast. The dac\_cast operator is the DAC-aware vesion of the C++ static\_cast operator (which the CLR coding conventions stipulate instead of C-style casts when casting pointer types). The dac\_cast operator will do any of the following things:
+
+1. Create a PTR type from a TADDR
+2. Convert one PTR type to another
+3. Create a PTR from a host instance previously marshaled to the DAC cache
+4. Extract the TADDR from a PTR type
+5. Get a TADDR from a host instance previously marshaled to the DAC cache
+
+Now, assuming both methodCold and methodCode are declared to be of type PTR\_BYTE, the code above can be rewritten as follows.
+
+ if (methodCold)
+ {
+ PTR_CORCOMPILE_METHOD_COLD_HEADER methodColdHeader
+ = dac_cast<PTR_CORCOMPILE_METHOD_COLD_HEADER>(methodCold);
+
+ if (methodCode == methodColdHeader->hotHeader)
+ {
+ // Matched the cold code
+ m_pCMH = methodColdHeader;
+
+You might argue that this code still seems complex and confusing, but at least we have significantly reduced the number of casts and constructors. We have also used constructs that maintain the separation between host and target pointers, so we have made the code safer. In particular, dac\_cast will often generate compiler or run-time errors if we try to do the wrong thing. In general, dac\_cast should be used for conversions.
+
+DACizing
+========
+
+When do you need to DACize?
+---------------------------
+
+Whenever you add a new feature, you will need to consider its debuggability needs and DACize the code to support your feature. You must also ensure that any other changes, such as bug fixes or code clean-up, conform to the DAC rules when necessary. Otherwise, the changes will break the debugger or SOS. If you are simply modifying existing code (as opposed to implementing a new feature), you will generally be able to determine that you need to worry about the DAC when a function you modify includes a SUPPORTS\_DAC contract. This contract has a few variants such as SUPPORTS\_DAC\_WRAPPER and LEAF\_DAC\_CONTRACT. You can find comments explaining the differences in [contract.h][contract.h]. If you see a number of DAC-specific types in the function, you should assume the code will run in DAC builds.
+
+[contract.h]: https://github.com/dotnet/coreclr/blob/master/src/inc/contract.h
+
+DACizing ensures that code in the engine will work correctly with the DAC. It is important to use the DAC correctly to marshal values from the target to the host. Target addresses used incorrectly from the host (or vice versa) may reference unmapped addresses. If addresses are mapped, the values will be completely unrelated to the values expected. As a result, DACizing mostly involves ensuring that we use PTR types for all values that the DAC needs to marshal. Another major task is to ensure that we do not allow invasive code to execute in DAC builds. In practice, this means that we must sometimes refactor code or add DACCESS\_COMPILE preprocessor directives. We also want to be sure that we add the appropriate SUPPORTS\_DAC contract. The use of this contract signals to developers that the function works with the DAC. This is important for two reasons:
+
+1. If we later call it from some other SUPPORTS\_DAC function, we know that it is DAC-safe and we don't need to worry about DACizing it.
+2. If we make modifications to the function, we need to make sure that they are DAC-safe. If we add a call to another function from this one, we also need to ensure that it is DAC-safe or that we only make the call in non-DAC builds.
diff --git a/Documentation/botr/exceptions.md b/Documentation/botr/exceptions.md
new file mode 100644
index 0000000000..daa684bf8b
--- /dev/null
+++ b/Documentation/botr/exceptions.md
@@ -0,0 +1,299 @@
+What Every Dev needs to Know About Exceptions in the Runtime
+============================================================
+
+Date: 2005
+
+When talking about "exceptions" in the CLR, there is an important distinction to keep in mind. There are managed exceptions, which are exposed to applications through mechanisms like C#'s try/catch/finally, with all of the runtime machinery to implement them. And then there is the use of exceptions inside the runtime itself. Most runtime developers seldom need to think about how to build and expose the managed exception model. But every runtime developer needs to understand how exceptions are used in the implementation of the runtime. When there is a need to keep the distinction clear, this document will refer to _managed exceptions_ that a managed application may throw or catch, and will refer to the _CLR's internal exceptions_ that are used by the runtime for its own error handling. Mostly, though, this document is about the CLR's internal exceptions.
+
+Where do exceptions matter?
+===========================
+
+Exceptions matter almost everywhere. They matter the most in functions that throw or catch exceptions, because that code must be written explicitly to throw the exception, or to catch and properly handle an exception. Even if a particular function doesn't itself throw an exception, it may well call one that does, and so that particular function must be written to behave correctly when an exception is thrown through it. The judicious use of _holders_ can greatly ease writing such code correctly.
+
+Why are CLR internal exceptions different?
+==========================================
+
+The CLR's internal exceptions are much like C++ exceptions, but not exactly. Rotor can be built for Mac OSX, for BSD, and for Windows. The OS and compiler differences dictate that we can't just use standard C++ try/catch. In addition, the CLR internal exceptions provide features similar to the managed "finally" and "fault".
+
+With the help of some macros, it is possible to write exception handling code that is almost as easy to write and to read as standard C++.
+
+Catching an Exception
+=====================
+
+EX_TRY
+------
+
+The basic macros are, of course, EX_TRY / EX_CATCH / EX_END_CATCH, and in use they look like this:
+
+ EX_TRY
+ // Call some function. Maybe it will throw an exception.
+ Bar();
+ EX_CATCH
+ // If we're here, something failed.
+ m_finalDisposition = terminallyHopeless;
+ EX_END_CATCH(RethrowTransientExceptions)
+
+The EX_TRY macro simply introduces the try block, and is much like the C++ "try", except that it also includes an opening brace, "{".
+
+EX_CATCH
+--------
+
+The EX_CATCH macro ends the try block, including the closing brace, "}", and begins the catch block. Like the EX_TRY, it also starts the catch block with an opening brace.
+
+And here is the big difference from C++ exceptions: the CLR developer doesn't get to specify what to catch. In fact, this set of macros catches everything, including non-C++ exceptions like AV or a managed exception. If a bit of code needs to catch just one exception, or a subset, then it will need to catch, examine the exception, and rethrow anything that isn't relevant.
+
+It bears repeating that the EX_CATCH macro catches everything. This behaviour is frequently not what a function needs. The next two sections discuss more about how to deal with exceptions that shouldn't have been caught.
+
+GET_EXCEPTION() & GET_THROWABLE()
+---------------------------------
+
+How, then, does a CLR developer discover just what has been caught, and determine what to do? There are several options, depending on just what the requirement is.
+
+First, whatever the (C++) exception that is caught, it will be delivered as an instance of some class derived from the global Exception class. Some of these derived classes are pretty obvious, like OutOfMemoryException. Some are somewhat domain specific, like EETypeLoadException. And some of these are just wrapper classes around another system's exceptions, like CLRException (has an OBJECTHANDLE to reference any managed exception) or HRException (wraps an HRESULT). If the original exception was not derived from Exception, the macros will wrap it up in something that is. (Note that all of these exceptions are system-provided and well known. _New exception classes shouldn't be added without involving the Core Execution Engine Team!_)
+
+Next, there is always an HRESULT associated with a CLR internal exception. Sometimes, as with HRException, the value came from some COM source, but internal errors and Win32 api failures also have HRESULTS.
+
+Finally, because almost any exception inside the CLR could possibly be delivered back to managed code, there is a mapping from the internal exceptions back to the corresponding managed exceptions. The managed exception won't necessarily be created, but there is always the possibility of obtaining it.
+
+So, given these features, how does the CLR developer categorize an exception?
+
+Frequently, all that is needed to categorize is the HRESULT that corresponds to the exception, and this is extremely easy to get:
+
+ HRESULT hr = GET_EXCEPTION()->GetHR();
+
+More information is often most conveniently available through the managed exception object. And if the exception will be delivered back to managed code, whether immediately, or cached for later, the managed object is, of course, required. And the exception object is just as easy to get. Of course, it is a managed objectref, so all the usual rules apply:
+
+ OBJECTREF throwable = NULL;
+ GCPROTECT_BEGIN(throwable);
+ // . . .
+ EX_TRY
+ // . . . do something that might throw
+ EX_CATCH
+ throwable = GET_THROWABLE();
+ EX_END_CATCH(RethrowTransientExceptions)
+ // . . . do something with throwable
+ GCPROTECT_END()
+
+Sometimes, there is no avoiding a need for the C++ exception object, though this is mostly inside the exception implementation. If it is important exactly what the C++ exception type is, there is a set of lightweight RTTI-like functions that help categorize exceptions. For instance,
+
+ Exception *pEx = GET_EXCEPTION();
+ if (pEx->IsType(CLRException::GetType())) {/* ... */}
+
+would tell whether the exception is (or derives from) CLRException.
+
+EX_END_CATCH(RethrowTransientExceptions)
+----------------------------------------
+
+In the example above, "RethrowTransientExceptions" is an argument to the EX_END_CATCH macro; it is one of three pre-defined macros that can be thought of "exception disposition". Here are the macros, and their meanings:
+
+- _SwallowAllExceptions_: This is aptly named, and very simple. As the name suggests, it swallows everything. While simple and appealing, this is often not the right thing to do.
+- _RethrowTerminalExceptions_. A better name would be "RethrowThreadAbort", which is what this macro does.
+- _RethrowTransientExceptions_. The best definition of a "transient" exception is one that might not occur if tried again, possibly in a different context. These are the transient exceptions:
+ - COR_E_THREADABORTED
+ - COR_E_THREADINTERRUPTED
+ - COR_E_THREADSTOP
+ - COR_E_APPDOMAINUNLOADED
+ - E_OUTOFMEMORY
+ - HRESULT_FROM_WIN32(ERROR_COMMITMENT_LIMIT)
+ - HRESULT_FROM_WIN32(ERROR_NOT_ENOUGH_MEMORY)
+ - (HRESULT)STATUS_NO_MEMORY
+ - COR_E_STACKOVERFLOW
+ - MSEE_E_ASSEMBLYLOADINPROGRESS
+
+The CLR developer with doubts about which macro to use should probably pick _RethrowTransientExceptions_.
+
+In every case, however, the developer writing an EX_END_CATCH needs to think hard about which exception should be caught, and should catch only those exceptions. And, because the macros catch everything anyway, the only way to not catch an exception is to rethrow it.
+
+If an EX_CATCH / EX_END_CATCH block has properly categorized its exceptions, and has rethrown wherever necessary, then SwallowAllExceptions is the way to tell the macros that no further rethrowing is necessary.
+
+## EX_CATCH_HRESULT
+
+Sometimes all that is needed is the HRESULT corresponding to an exception, particularly when the code is in an interface from COM. For these cases, EX_CATCH_HRESULT is simpler than writing a while EX_CATCH block. A typical case would look like this:
+
+ HRESULT hr;
+ EX_TRY
+ // code
+ EX_CATCH_HRESULT (hr)
+
+ return hr;
+
+_However, while very tempting, it is not always correct_. The EX_CATCH_HRESULT catches all exceptions, saves the HRESULT, and swallows the exception. So, unless that exception swallowing is what the function really needs, EX_CATCH_HRESULT is not appropriate.
+
+EX_RETHROW
+----------
+
+As noted above, the exception macros catch all exceptions; the only way to catch a specific exception is to catch all, and rethrow all but the one(s) of interest. So, if, after an exception is caught, examined, possibly logged, and so forth, it shouldn't be caught, it may be re-thrown. EX_RETHROW will re-raise the same exception.
+
+Not catching an exception
+=========================
+
+It's frequently the case that a bit of code doesn't need to catch an exception, but does need to perform some sort of cleanup or compensating action, Holders are frequently just the thing for this scenario, but not always. For the times that holders aren't adequate, the CLR has two variations on a "finally" block.
+
+EX_TRY_FOR_FINALLY
+------------------
+
+When there is a need for some sort of compensating action as code exits, a finally may be appropriate. There is a set of macros to implement a try/finally in the CLR:
+
+ EX_TRY_FOR_FINALLY
+ // code
+ EX_FINALLY
+ // exit and/or backout code
+ EX_END_FINALLY
+
+**Important** : The EX_TRY_FOR_FINALLY macros are built with SEH, rather than C++ EH, and the C++ compiler doesn't allow SEH and C++ EH to be mixed in the same function. Locals with auto-destructors require C++ EH for their destructor to run. Therefore, any function with EX_TRY_FOR_FINALLY can't have EX_TRY, and can't have any local variable with an auto-destructor.
+
+EX_HOOK
+-------
+
+Frequently there is a need for compensating code, but only when an exception is thrown. For these cases, EX_HOOK is similar to EX_FINALLY, but the "hook" clause only runs when there is an exception. The exception is automatically rethrown at the end of the "hook" clause.
+
+ EX_TRY
+ // code
+ EX_HOOK
+ // code to run when an exception escapes the “code” block.
+ EX_END_HOOK
+
+This construct is somewhat better than simply EX_CATCH with EX_RETHROW, because it will rethrow a non-stack-overflow, but will catch a stack overflow exception (and unwind the stack) and then throw a new stack overflow exception.
+
+Throwing an Exception
+=====================
+
+Throwing an Exception in the CLR is generally a matter of calling
+
+ COMPlusThrow ( < args > )
+
+There are a number of overloads, but the idea is to pass the "kind" of the exception to COMPlusThrow. The list of "kinds" is generated by a set of macros operating on [Rexcep.h](https://github.com/dotnet/coreclr/blob/master/src/vm/rexcep.h), and the various "kinds" are kAmbiguousMatchException, kApplicationException, and so forth. Additional arguments (for the overloads) specify resources and substitution text. Generally, the right "kind" is selected by looking for other code that reports a similar error.
+
+There are some pre-defined convenience variations:
+
+COMPlusThrowOOM();
+------------------
+
+Defers to ThrowOutOfMemory(), which throws the C++ OOM exception. This will throw a pre-allocated exception, to avoid the problem of being out of memory trying to throw an out of memory exception!
+
+When getting the managed exception object for this exception, the runtime will first try to allocate a new managed object <sup>[1]</sup>, and if that fails, will return a pre-allocated, shared, global out of memory exception object.
+
+[1] After all, if it was a request for a 2gb array that failed, a simple object may be fine.
+
+COMPlusThrowHR(HRESULT theBadHR);
+---------------------------------
+
+There are a number of overloads, in case you have an IErrorInfo, etc. There is some surprisingly complicated code to figure out what kind of exception corresponds to a particular HRESULT.
+
+COMPlusThrowWin32(); / COMPlusThrowWin32(hr);
+---------------------------------------------
+
+Basically throws an HRESULT_FROM_WIN32(GetLastError())
+
+COMPlusThrowSO();
+-----------------
+
+Throws a Stack Overflow (SO) Exception. Note that this is not a hard SO, but rather an exception we throw when proceeding might lead to a hard SO.
+
+Like OOM, this throws a pre-allocated C++ SO exception object. Unlike OOM, when retrieving the managed object, the runtime always returns the pre-allocated, shared, global stack overflow exception object.
+
+COMPlusThrowArgumentNull()
+--------------------------
+
+A helper for throwing an "argument foo must not be null" exception.
+
+COMPlusThrowArgumentOutOfRange()
+--------------------------------
+
+As it sounds.
+
+COMPlusThrowArgumentException()
+-------------------------------
+
+Yet another flavor of invalid argument exception.
+
+COMPlusThrowInvalidCastException(thFrom, thTo)
+----------------------------------------------
+
+Given type handles to from and to types of the attempted cast, the helper creates the a nicely formatted exception message.
+
+EX_THROW
+--------
+
+This is a low-level throw construct that is not generally needed in normal code. Many of the COMPlusThrowXXX functions use EX_THROW internally, as do other specialized ThrowXXX functions. It is best to minimize direct use of EX_THROW, simply to keep the nitty-gritty details of the exception mechanism as well encapsulated as possible. But when none of the higher-level Throw functions work, it is fine to use EX_THROW.
+
+The macro takes two arguments, the type of exception to be thrown (some sub-type of the C++ Exception class), and a parenthesized list of arguments to the exception type's constructor.
+
+Using SEH directly
+==================
+
+There are a few situations where it is appropriate to use SEH directly. In particular, SEH is the only option if some processing is needed on the first pass, that is, before the stack is unwound. The filter code in an SEH __try/__except can do anything, in addition to deciding whether to handle an exception. Debugger notifications is an area that sometimes needs first pass handling.
+
+Filter code needs to be written very carefully. In general, the filter code must be prepared for any random, and likely inconsistent, state. Because the filter runs on the first pass, and dtors run on the second pass, holders won't have run yet, and will not have restored their state.
+
+PAL_TRY / PAL_EXCEPT, PAL_EXCEPT_FILTER, PAL_FINALLY / PAL_ENDTRY
+-----------------------------------------------------------------
+
+When a filter is needed, the PAL_TRY family is the portable way to write one in the CLR. Because the filter uses SEH directly, it is incompatible with C++ EH in the same function, and so there can't be any holders in the function.
+
+Again, these should be rare.
+
+__try / __except, __finally
+---------------------------
+
+There isn't a good reason to use these directly in the CLR.
+
+Exceptions and GC mode
+======================
+
+Throwing an exception with COMPlusThrowXXX() doesn't affect the GC mode, and is safe in any mode. As the exception unwinds back to the EX_CATCH, any holders that were on the stack will be unwound, releasing their resources and resetting their state. By the time that execution resumes in the EX_CATCH, the holder-protected state will have been restored to what it was at the time of the EX_TRY.
+
+Transitions
+===========
+
+Considering managed code, the CLR, COM servers, and other native code, there are many possible transitions between calling conventions, memory management, and, of course, exception handling mechanisms. Regarding exceptions, it is fortunate for the CLR developer that most of these transitions are either completely outside of the runtime, or are handled automatically. There are three transitions that are a daily concern for a CLR developer. Anything else is an advanced topic, and those who need to know about them, are well aware that they need to know!
+
+Managed code into the runtime
+-----------------------------
+
+This is the "fcall", "jit helper", and so forth. The typical way that the runtime reports errors back to managed code is through a managed exception. So, if an fcall function, directly or indirectly, raises a managed exception, that's perfectly fine. The normal CLR managed exception implementation will "do the right thing" and look for an appropriate managed handler.
+
+On the other hand, if an fcall function can do anything that might throw a CLR internal exception (one of the C++ exceptions), that exception must not be allowed to leak back out to managed code. To handle this case, the CLR has the UnwindAndContinueHandler (UACH), which is a set of code to catch the C++ EH exceptions, and re-raise them as managed exceptions.
+
+Any runtime function that is called from managed code, and might throw a C++ EH exception, must wrap the throwing code in INSTALL_UNWIND_AND_CONTINUE_HANDLER / UNINSTALL_UNWIND_AND_CONTINUE_HANDLER. Installing a HELPER_METHOD_FRAME will automatically install the UACH. There is a non-trivial amount of overhead to installing a UACH, so they shouldn't be used everywhere. One technique that is used in performance critical code is to run without a UACH, and install one just before throwing an exception.
+
+When a C++ exception is thrown, and there is a missing UACH, the typical failure will be a Contract Violation of "GC_TRIGGERS called in a GC_NOTRIGGER region" in CPFH_RealFirstPassHandler. To fix these, look for managed to runtime transitions, and check for INSTALL_UNWIND_AND_CONTINUE_HANDLER or HELPER_METHOD_FRAME_BEGIN_XXX.
+
+Runtime code into managed code
+------------------------------
+
+The transition from the runtime into managed code has highly platform-dependent requirements. On 32-bit Windows platforms, the CLR's managed exception code requires that "COMPlusFrameHandler" is installed just before entering managed code. These transitions are handled by highly specialized helper functions, which take care of the appropriate exception handlers. It is very unlikely that any typical new calls into managed would use any other way in. In the event that the COMPlusFrameHander were missing, the most likely effect would be that exception handling code in the target managed code simply wouldn't be executed – no finally blocks, and no catch blocks.
+
+Runtime code into external native code
+--------------------------------------
+
+Calls from the runtime into other native code (the OS, the CRT, and other DLLs) may need particular attention. The cases that matter are those in which the external code might cause an exception. The reason that this is a problem comes from the implementation of the EX_TRY macros, and in particular how they translate or wrap non-Exceptions into Exceptions. With C++ EH, it is possible to catch any and all exceptions (via "catch(...)"), but only by giving up all information about what has been caught. When catching an Exception*, the macros have the exception object to examine, but when catching anything else, there is nothing to examine, and the macros must guess what the actual exception is. And when the exception comes from outside of the runtime, the macros will always guess wrong.
+
+The current solution is to wrap the call to external code in a "callout filter". The filter will catch the external exception, and translate it into SEHException, one of the runtime's internal exceptions. This filter is predefined, and is simple to use. However, using a filter means using SEH, which of course precludes using C++ EH in the same function. To add a callout filter to a function that uses C++ EH will require splitting a function in two.
+
+To use the callout filter, instead of this:
+
+ length = SysStringLen(pBSTR);
+
+write this:
+
+ BOOL OneShot = TRUE;
+
+ PAL_TRY
+ {
+ length = SysStringLen(pBSTR);
+ }
+ PAL_EXCEPT_FILTER(CallOutFilter, &OneShot)
+ {
+ _ASSERTE(!"CallOutFilter returned EXECUTE_HANDLER.");
+ }
+ PAL_ENDTRY;
+
+A missing callout filter on a call that raises an exception will always result in the wrong exception being reported in the runtime. The type that is incorrectly reported isn't even always deterministic; if there is already some managed exception "in flight", then that managed exception is what will be reported. If there is no current exception, then OOM will be reported. On a checked build there are asserts that usually fire for a missing callout filter. These assert messages will include the text "The runtime may have lost track of the type of an exception".
+
+Miscellaneous
+=============
+
+There are actually a lot of macros involved in EX_TRY. Most of them should never, ever, be used outside of the macro implementations.
+
+One set, BEGIN_EXCEPTION_GLUE / END_EXCEPTION_GLUE, deserves special mention. These were intended to be transitional macros, and were to be replaced with more appropriate macros in the Whidbey project. Of course, they worked just fine, and so they weren't all replaced. Ideally, all instances will be converted during a "cleanup" milestone, and the macros removed. In the meantime, any CLR dev tempted to use them should resist, and instead write EX_TRY/EX_CATCH/EX_CATCH_END or EX_CATCH_HRESULT.
diff --git a/Documentation/botr/garbage-collection.md b/Documentation/botr/garbage-collection.md
new file mode 100644
index 0000000000..9e16131114
--- /dev/null
+++ b/Documentation/botr/garbage-collection.md
@@ -0,0 +1,332 @@
+Garbage Collection Design
+=========================
+Author: Maoni Stephens ([@maoni0](https://github.com/maoni0)) - 2015
+
+Note: See the The Garbage Collection Handbook referenced in the resources section at the end of this document to learn more about garbage collection topics.
+
+Component Architecture
+======================
+
+The 2 components that belong to GC are the allocator and the
+collector. The allocator is responsible for getting more memory and triggering the collector when appropriate. The collector reclaims garbage, or the memory of objects that are no longer in use by the program.
+
+There are other ways that the collector can get called, such as manually calling GC.Collect or the finalizer thread receiving an asynchronous notification of the low memory (which triggers the collector).
+
+Design of Allocator
+===================
+
+The allocator gets called by the allocation helpers in the Execution Engine (EE), with the following information:
+
+- Size requested
+- Thread allocation context
+- Flags that indicate things like whether this is a finalizable object or not
+
+The GC does not have special treatment for different kinds of object types. It consults the EE to get the size of an object.
+
+Based on the size, the GC divides objects into 2 categories: small
+objects (< 85,000 bytes) and large objects (>= 85,000 bytes). In
+principle, small and large objects can be treated the same way but
+since compacting large objects is more expensive GC makes this
+distinction.
+
+When the GC gives out memory to the allocator, it does so in terms of allocation contexts. The size of an allocation context is defined by the allocation quantum.
+
+- **Allocation contexts** are smaller regions of a given heap segment that are each dedicated for use by a given thread. On a single-processor (meaning 1 logical processor) machine, a single context is used, which is the generation 0 allocation context.
+- The **Allocation quantum** is the size of memory that the allocator allocates each time it needs more memory, in order to perform object allocations within an allocation context. The allocation is typically 8k and the average size of managed objects are around 35 bytes, enabling a single allocation quantum to be used for many object allocations.
+
+Large objects do not use allocation contexts and quantums. A single large object can itself be larger than these smaller regions of memory. Also, the benefits (discussed below) of these regions are specific to smaller objects. Large objects are allocated directly to a heap segment.
+
+The allocator is designed to achieve the following:
+
+- **Triggering a GC when appropriate:** The allocator triggers a GC when the allocation budget (a threshold set by the collector) is exceeded or when the allocator can no longer allocate on a given segment. The allocation budget and managed segments are discussed in more detail later.
+- **Preserving object locality:** Objects allocated together on the same heap segment will be stored at virtual addresses close to each other.
+- **Efficient cache usage:** The allocator allocates memory in _allocation quantum_ units, not on an object-by-object basis. It zeroes out that much memory to warm up the CPU cache because there will be objects immediately allocated in that memory. The allocation quantum is usually 8k.
+- **Efficient locking:** The thread affinity of allocation contexts and quantums guarantee that there is only ever a single thread writing to a given allocation quantum. As a result, there is no need to lock for object allocations, as long as the current allocation context is not exhausted.
+- **Memory integrity:** The GC always zeroes out the memory for newly allocated objects to prevent object references pointing at random memory.
+- **Keeping the heap crawlable:** The allocator makes sure to make a free object out of left over memory in each allocation quantum. For example, if there is 30 bytes left in an allocation quantum and the next object is 40 bytes, the allocator will make the 30 bytes a free object and get a new allocation quantum.
+
+Allocation APIs
+---------------
+
+ Object* GCHeap::Alloc(size_t size, DWORD flags);
+ Object* GCHeap::Alloc(alloc_context* acontext, size_t size, DWORD flags);
+
+The above functions can be used to allocate both small objects and
+large objects. There is also a function to allocate directly on LOH:
+
+ Object* GCHeap::AllocLHeap(size_t size, DWORD flags);
+
+Design of the Collector
+=======================
+
+Goals of the GC
+---------------
+
+The GC strives to manage memory extremely efficiently and
+require very little effort from people who write "managed code". Efficient means:
+
+- GCs should occur often enough to avoid the managed heap containing a significant amount (by ratio or absolute count) of unused but allocated objects (garbage), and therefore use memory unnecessarily.
+- GCs should happen as infrequently as possible to avoid using otherwise useful CPU time, even though frequent GCs would result in lower memory usage.
+- A GC should be productive. If GC reclaims a small amount of memory, then the GC (including the associated CPU cycles) was wasted.
+- Each GC should be fast. Many workloads have low latency requirements.
+- Managed code developers shouldn’t need to know much about the GC to achieve good memory utilization (relative to their workload).
+– The GC should tune itself to satisfy different memory usage patterns.
+
+Logical representation of the managed heap
+------------------------------------------
+
+The CLR GC is a generational collector which means objects are
+logically divided into generations. When a generation _N_ is collected,
+the survived objects are now marked as belong to generation _N+1_. This
+process is called promotion. There are exceptions to this when we
+decide to demote or not promote.
+
+For small objects the heap is divided into 3 generations: gen0, gen1
+and gen2. For large objects there’s one generation – gen3. Gen0 and gen1 are referred to as ephemeral (objects lasting for a short time) generations.
+
+For the small object heap, the generation number represents the age – gen0
+being the youngest generation. This doesn’t mean all objects in gen0
+are younger than any objects in gen1 or gen2. There are exceptions
+which will be explained below. Collecting a generation means collecting
+objects in that generation and all its younger generations.
+
+In principle large objects can be handled the same way as small
+objects but since compacting large objects is very expensive, they are treated differently. There is only one generation for large objects and
+they are always collected with gen2 collections due to performance
+reasons. Both gen2 and gen3 can be big, and collecting ephemeral generations (gen0 and gen1) needs to have a bounded cost.
+
+Allocations are made in the youngest generation – for small objects this means always gen0 and for large objects this means gen3 since there’s only one generation.
+
+Physical representation of the managed heap
+-------------------------------------------
+
+The managed heap is a set of managed heap segments. A heap segment is a contiguous block of memory that is acquired by the GC from the OS. The heap segments are
+partitioned into small and large object segments, given the distinction of small and large objects. On each heap the heap segments are chained together. There is at least one small object segment and one large segment - they are reserved when CLR is loaded.
+
+There’s always only one ephemeral segment in each small object heap, which is where gen0 and gen1 live. This segment may or may not include gen2
+objects. In addition to the ephemeral segment, there can be zero, one or more additional segments, which will be gen2 segments since they only contain gen2 objects.
+
+There are 1 or more segments on the large object heap.
+
+A heap segment is consumed from the lower address to the higher
+address, which means objects of lower addresses on the segment are
+older than those of higher addresses. Again there are exceptions that
+will be described below.
+
+Heap segments can be acquired as needed. They are deleted when they
+don’t contain any live objects, however the initial segment on the heap
+will always exist. For each heap, one segment at a time is acquired,
+which is done during a GC for small objects and during allocation time
+for large objects. This design provides better performance because large objects are only collected with gen2 collections (which are relatively expensive).
+
+Heap segments are chained together in order of when they were acquired. The last segment in the chain is always the ephemeral segment. Collected segments (no live objects) can be reused instead of deleted and instead become the new ephemeral segment. Segment reuse is only implemented for small object heap. Each time a large object is allocated, the whole large object heap is considered. Small object allocations only consider the ephemeral segment.
+
+The allocation budget
+---------------------
+
+The allocation budget is a logical concept associated with each
+generation. It is a size limit that triggers a GC for that
+generation when it exceeded.
+
+The budget is a property set on the generation mostly based on the
+survival rate of that generation. If the survival rate is high, the budget is made larger with the expectation that there will be a better ratio of dead to live objects next time there is a GC for that generation.
+
+Determining which generation to collect
+---------------------------------------
+
+When a GC is triggered, the GC must first determine which generation to collect. Besides the allocation budget there are other factors that must be considered:
+
+- Fragmentation of a generation – if a generation has high fragmentation, collecting that generation is likely to be productive.
+- If the memory load on the machine is too high, the GC may collect
+ more aggressively if that’s likely to yield free space. This is important to
+ prevent unnecessary paging (across the machine).
+- If the ephemeral segment is running out of space, the GC may do more aggressive ephemeral collections (meaning doing more gen1’s) to avoid acquiring a new heap segment.
+
+The flow of a GC
+----------------
+
+Mark phase
+----------
+
+The goal of the mark phase is to find all live objects.
+
+The benefit of a generational collector is the ability to collect just part of
+the heap instead of having to look at all of the objects all the
+time. When collecting the ephemeral generations, the GC needs to find out which objects are live in these generations, which is information reported by the EE. Besides the objects kept live by the EE, objects in older generations
+can also keep objects in younger generations live by making references
+to them.
+
+The GC uses cards for the older generation marking. Cards are set by JIT
+helpers during assignment operations. If the JIT helper sees an
+object in the ephemeral range it will set the byte that contains the
+card representing the source location. During ephemeral collections, the GC can look at the set cards for the rest of the heap and only look at the objects that these cards correspond to.
+
+Plan phase
+---------
+
+The plan phase simulates a compaction to determine the effective result. If compaction is productive the GC starts an actual compaction; otherwise it sweeps.
+
+Relocate phase
+--------------
+
+If the GC decides to compact, which will result in moving objects, then references to these objects must be updated. The relocate phase needs to find all references that point to objects that are in the
+generations being collected. In contrast, the mark phase only consults live objects so doesn’t need to consider weak references.
+
+Compact phase
+-------------
+
+This phase is very straight forward since the plan phase already
+calculated the new addresses the objects should move to. The compact
+phase will copy the objects there.
+
+Sweep phase
+-----------
+
+The sweep phase looks for the dead space in between live objects. It creates free objects in place of these dead spaces. Adjacent dead objects are made into one free object. It places all of these free objects onto the _freelist_.
+
+Code Flow
+=========
+
+Terms:
+
+- **WKS GC:** Workstation GC.
+- **SRV GC:** Server GC
+
+Functional Behavior
+-------------------
+
+### WKS GC with concurrent GC off
+
+1. User thread runs out of allocation budget and triggers a GC.
+2. GC calls SuspendEE to suspend managed threads.
+3. GC decides which generation to condemn.
+4. Mark phase runs.
+5. Plan phase runs and decides if a compacting GC should be done.
+6. If so relocate and compact phase runs. Otherwise, sweep phase runs.
+7. GC calls RestartEE to resume managed threads.
+8. User thread resumes running.
+
+### WKS GC with concurrent GC on
+
+This illustrates how a background GC is done.
+
+1. User thread runs out of allocation budget and triggers a GC.
+2. GC calls SuspendEE to suspend managed threads.
+3. GC decides if background GC should be run.
+4. If so background GC thread is woken up, to do a background
+ GC. Background GC thread calls RestartEE to resume managed threads.
+5. Managed threads continue allocating while the background GC does its work.
+6. User thread may run out of allocation budget and trigger an
+ ephemeral GC (what we call a foreground GC). This is done in the same
+ fashion as the "WKS GC with concurrent GC off" flavor.
+7. Background GC calls SuspendEE again to finish with marking and then
+ calls RestartEE to start the concurrent sweep phase while user threads
+ are running.
+8. Background GC is finished.
+
+### SVR GC with concurrent GC off
+
+1. User thread runs out of allocation budget and triggers a GC.
+2. Server GC threads are woken up and calls SuspendEE to suspend
+ managed threads.
+3. Server GC threads do the GC work (same phases as in workstation GC
+ without concurrent GC).
+4. Server GC threads call RestartEE to resume managed threads.
+5. User thread resumes running.
+
+### SVR GC with concurrent GC on
+
+This scenario is the same as WKS GC with concurrent GC on, except the non background GCs are done on SVR GC threads.
+
+Physical Architecture
+=====================
+
+This section is meant to help you follow the code flow.
+
+User thread runs out of quantum and gets a new quantum via try_allocate_more_space.
+
+try_allocate_more_space calls GarbageCollectGeneration when it needs to trigger a GC.
+
+Given WKS GC with concurrent GC off, GarbageCollectGeneration is done all
+on the user thread that triggerred the GC. The code flow is:
+
+ GarbageCollectGeneration()
+ {
+ SuspendEE();
+ garbage_collect();
+ RestartEE();
+ }
+
+ garbage_collect()
+ {
+ generation_to_condemn();
+ gc1();
+ }
+
+ gc1()
+ {
+ mark_phase();
+ plan_phase();
+ }
+
+ plan_phase()
+ {
+ // actual plan phase work to decide to
+ // compact or not
+ if (compact)
+ {
+ relocate_phase();
+ compact_phase();
+ }
+ else
+ make_free_lists();
+ }
+
+Given WKS GC with concurrent GC on (default case), the code flow for a background GC is
+
+ GarbageCollectGeneration()
+ {
+ SuspendEE();
+ garbage_collect();
+ RestartEE();
+ }
+
+ garbage_collect()
+ {
+ generation_to_condemn();
+ // decide to do a background GC
+ // wake up the background GC thread to do the work
+ do_background_gc();
+ }
+
+ do_background_gc()
+ {
+ init_background_gc();
+ start_c_gc ();
+
+ //wait until restarted by the BGC.
+ wait_to_proceed();
+ }
+
+ bgc_thread_function()
+ {
+ while (1)
+ {
+ // wait on an event
+ // wake up
+ gc1();
+ }
+ }
+
+ gc1()
+ {
+ background_mark_phase();
+ background_sweep();
+ }
+
+Resources
+=========
+
+- [.NET CLR GC Implementation](https://raw.githubusercontent.com/dotnet/coreclr/master/src/gc/gc.cpp)
+- [The Garbage Collection Handbook: The Art of Automatic Memory Management](http://www.amazon.com/Garbage-Collection-Handbook-Management-Algorithms/dp/1420082795)
+- [Garbage collection (Wikipedia)](http://en.wikipedia.org/wiki/Garbage_collection_(computer_science))
diff --git a/Documentation/botr/intro-to-clr.md b/Documentation/botr/intro-to-clr.md
new file mode 100644
index 0000000000..3ba3d3b81e
--- /dev/null
+++ b/Documentation/botr/intro-to-clr.md
@@ -0,0 +1,261 @@
+Introduction to the Common Language Runtime (CLR)
+===
+
+By Vance Morrison ([@vancem](https://github.com/vancem)) - 2007
+
+What is the Common Language Runtime (CLR)? To put it succinctly:
+
+> The Common Language Runtime (CLR) is a complete, high level virtual machine designed to support a broad variety of programming languages and interoperation among them.
+
+Phew, that was a mouthful. It also in and of itself is not very illuminating. The statement above _is_ useful however, because it is the first step in taking the large and complicated piece of software known as the [CLR][clr] and grouping its features in an understandable way. It gives us a "10,000 foot" view of the runtime from which we can understand the broad goals and purpose of the runtime. After understanding the CLR at this high level, it is easier to look more deeply into sub-components without as much chance of getting lost in the details.
+
+# The CLR: A (very rare) Complete Programming Platform
+
+Every program has a surprising number of dependencies on its runtime environment. Most obviously, the program is written in a particular programming language, but that is only the first of many assumptions a programmer weaves into the program. All interesting programs need some _runtime library_ that allows them to interact with the other resources of the machine (such as user input, disk files, network communications, etc). The program also needs to be converted in some way (either by interpretation or compilation) to a form that the native hardware can execute directly. These dependencies of a program are so numerous, interdependent and diverse that implementers of programming languages almost always defer to other standards to specify them. For example, the C++ language does not specify the format of a C++ executable. Instead, each C++ compiler is bound to a particular hardware architecture (e.g., X86) and to an operating system environment (e.g., Windows, Linux, or Mac OS), which describes the format of the executable file format and specifies how it will be loaded. Thus, programmers don't make a "C++ executable," but rather a "Windows X86 executable" or a "Power PC Mac OS executable."
+
+While leveraging existing hardware and operating system standards is usually a good thing, it has the disadvantage of tying the specification to the level of abstraction of the existing standards. For example, no common operating system today has the concept of a garbage-collected heap. Thus, there is no way to use existing standards to describe an interface that takes advantage of garbage collection (e.g., passing strings back and forth, without worrying about who is responsible for deleting them). Similarly, a typical executable file format provides just enough information to run a program but not enough information for a compiler to bind other binaries to the executable. For example, C++ programs typically use a standard library (on Windows, called msvcrt.dll) which contains most of the common functionality (e.g., printf), but the existence of that library alone is not enough. Without the matching header files that go along with it (e.g., stdio.h), programmers can't use the library. Thus, existing executable file format standards cannot be used both to describe a file format that can be run and to specify other information or binaries necessary to make the program complete.
+
+The CLR fixes problems like these by defining a [very complete specification][ecma-spec] (standardized by ECMA) containing the details you need for the COMPLETE lifecycle of a program, from construction and binding through deployment and execution. Thus, among other things, the CLR specifies:
+
+- A GC-aware virtual machine with its own instruction set (called the Common Intermediate Language (CIL)) used to specify the primitive operations that programs perform. This means the CLR is not dependent on a particular type of CPU.
+- A rich meta data representation for program declarations (e.g., types, fields, methods, etc), so that compilers generating other executables have the information they need to call functionality from 'outside'.
+- A file format that specifies exactly how to lay the bits down in a file, so that you can properly speak of a CLR EXE that is not tied to a particular operating system or computer hardware.
+- The lifetime semantics of a loaded program, the mechanism by which one CLR EXE file can refer to another CLR EXE and the rules on how the runtime finds the referenced files at execution time.
+- A class library that leverages the features that the CLR provides (e.g., garbage collection, exceptions, or generic types) to give access both to basic functionality (e.g., integers, strings, arrays, lists, or dictionaries) as well as to operating system services (e.g., files, network, or user interaction).
+
+Multi-language Support
+----------------------
+
+Defining, specifying and implementing all of these details is a huge undertaking, which is why complete abstractions like the CLR are very rare. In fact, the vast majority of such reasonably complete abstractions were built for single languages. For example, the Java runtime, the Perl interpreter or the early version of the Visual Basic runtime offer similarly complete abstraction boundaries. What distinguishes the CLR from these earlier efforts is its multi-language nature. With the possible exception of Visual Basic (because it leverages the COM object model), the experience within the language is often very good, but interoperating with programs written in other languages is very difficult at best. Interoperation is difficult because these languages can only communicate with "foreign" languages by using the primitives provided by the operating system. Because the OS abstraction level is so low (e.g., the operating system has no concept of a garbage-collected heap), needlessly complicated techniques are necessary. By providing a COMMON LANGUAGE RUNTIME, the CLR allows languages to communicate with each other with high-level constructs (e.g., GC-collected structures), easing the interoperation burden dramatically.
+
+Because the runtime is shared among _many_ languages, it means that more resources can be put into supporting it well. Building good debuggers and profilers for a language is a lot of work, and thus they exist in a full-featured form only for the most important programming languages. Nevertheless, because languages that are implemented on the CLR can reuse this infrastructure, the burden on any particular language is reduced substantially. Perhaps even more important, any language built on the CLR immediately has access to _all_ the class libraries built on top of the CLR. This large (and growing) body of (debugged and supported) functionality is a huge reason why the CLR has been so successful.
+
+In short, the runtime is a complete specification of the exact bits one has to put in a file to create and run a program. The virtual machine that runs these files is at a high level appropriate for implementing a broad class of programming languages. This virtual machine, along with an ever growing body of class libraries that run on that virtual machine, is what we call the common language runtime (CLR).
+
+# The Primary Goal of the CLR
+
+Now that we have basic idea what the CLR is, it is useful to back up just a bit and understand the problem the runtime was meant to solve. At a very high level, the runtime has only one goal:
+
+> The goal of the CLR is to make programming easy.
+
+This statement is useful for two reasons. First, it is a _very_ useful guiding principle as the runtime evolves. For example, fundamentally only simple things can be easy, so adding **user visible** complexity to the runtime should always be viewed with suspicion. More important than the cost/benefit ratio of a feature is its _added exposed complexity/weighted benefit over all scenarios_ ratio. Ideally, this ratio is negative (that is, the new feature reduces complexity by removing restrictions or by generalizing existing special cases); however, more typically it is kept low by minimizing the exposed complexity and maximizing the number of scenarios to which the feature adds value.
+
+The second reason this goal is so important is that **ease of use is the fundamental reason for the CLR's success**. The CLR is not successful because it is faster or smaller than writing native code (in fact, well-written native code often wins). The CLR is not successful because of any particular feature it supports (like garbage collection, platform independence, object-oriented programming or versioning support). The CLR is successful because all of those features, as well as numerous others, combine to make programming significantly easier than it would be otherwise. Some important but often overlooked ease of use features include:
+
+1. Simplified languages (e.g., C# and Visual Basic are significantly simpler than C++)
+2. A dedication to simplicity in the class library (e.g., we only have one string type, and it is immutable; this greatly simplifies any API that uses strings)
+3. Strong consistency in the naming in the class library (e.g., requiring APIs to use whole words and consistent naming conventions)
+4. Great support in the tool chain needed to create an application (e.g., Visual Studio makes building CLR applications very simple, and Intellisense makes finding the right types and methods to create the application very easy).
+
+It is this dedication to ease of use (which goes hand in hand with simplicity of the user model) that stands out as the reason for the success of the CLR. Oddly, some of the most important ease-of-use features are also the most "boring." For example, any programming environment could apply consistent naming conventions, yet actually doing so across a large class library is quite a lot of work. Often such efforts conflict with other goals (such as retaining compatibility with existing interfaces), or they run into significant logistical concerns (such as the cost of renaming a method across a _very_ large code base). It is at times like these that we have to remind ourselves about our number-one overarching goal of the runtime and ensure that we are have our priorities straight to reach that goal.
+
+# Fundamental Features of the CLR
+
+The runtime has many features, so it is useful to categorize them as follows:
+
+1. Fundamental features – Features that have broad impact on the design of other features. These include:
+ a. Garbage Collection
+ b. Memory Safety and Type Safety
+ c. High level support for programming languages.
+
+2. Secondary features – Features enabled by the fundamental features that may not be required by many useful programs:
+ a. Program isolation with AppDomains
+ b. Program Security and sandboxing
+
+3. Other Features – Features that all runtime environments need but that do not leverage the fundamental features of the CLR. Instead, they are the result of the desire to create a complete programming environment. Among them are:
+ a. Versioning
+ b. Debugging/Profiling
+ c. Interoperation
+
+## The CLR Garbage Collector (GC)
+
+Of all the features that the CLR provides, the garbage collector deserves special notice. Garbage collection (GC) is the common term for automatic memory reclamation. In a garbage-collected system, user programs no longer need to invoke a special operator to delete memory. Instead the runtime automatically keeps track of all references to memory in the garbage-collected heap, and from time-to-time, it will traverse these references to find out which memory is still reachable by the program. All other memory is _garbage_ and can be reused for new allocations.
+
+Garbage collection is a wonderful user feature because it simplifies programming. The most obvious simplification is that most explicit delete operations are no longer necessary. While removing the delete operations is important, the real value to the programmer is a bit more subtle:
+
+1. Garbage collection simplifies interface design because you no longer have to carefully specify which side of the interface is responsible for deleting objects passed across the interface. For example, CLR interfaces simply return strings; they don't take string buffers and lengths. This means they don't have to deal with the complexity of what happens when the buffers are too small. Thus, garbage collection allows ALL interfaces in the runtime to be simpler than they otherwise would be.
+2. Garbage collection eliminates a whole class of common user mistakes. It is frightfully easy to make mistakes concerning the lifetime of a particular object, either deleting it too soon (leading to memory corruption), or too late (unreachable memory leaks). Since a typical program uses literally MILLIONS of objects, the probability for error is quite high. In addition, tracking down lifetime bugs is very difficult, especially if the object is referenced by many other objects. Making this class of mistakes impossible avoids a lot of grief.
+
+Still, it is not the usefulness of garbage collection that makes it worthy of special note here. More important is the simple requirement it places on the runtime itself:
+
+> Garbage collection requires ALL references to the GC heap to be tracked.
+
+While this is a very simple requirement, it in fact has profound ramifications for the runtime. As you can imagine, knowing where every pointer to an object is at every moment of program execution can be quite difficult. We have one mitigating factor, though. Technically, this requirement only applies to when a GC actually needs to happen (thus, in theory we don't need to know where all GC references are all the time, but only at the time of a GC). In practice, however, this mitigation doesn't completely apply because of another feature of the CLR:
+
+> The CLR supports multiple concurrent threads of execution with a single process.
+
+At any time some other thread of execution might perform an allocation that requires a garbage collection. The exact sequence of operations across concurrently executing threads is non-deterministic. We can't tell exactly what one thread will be doing when another thread requests an allocation that will trigger a GC. Thus, GCs can really happen any time. Now the CLR does NOT need to respond _immediately_ to another thread's desire to do a GC, so the CLR has a little "wiggle room" and doesn't need to track GC references at _all_ points of execution, but it _does_ need to do so at enough places that it can guarantee "timely" response to the need to do a GC caused by an allocation on another thread.
+
+What this means is that the CLR needs to track _all_ references to the GC heap _almost_ all the time. Since GC references may reside in machine registers, in local variables, statics, or other fields, there is quite a bit to track. The most problematic of these locations are machine registers and local variables because they are so intimately related to the actual execution of user code. Effectively, what this means is that the _machine code_ that manipulates GC references has another requirement: it must track all the GC references that it uses. This implies some extra work for the compiler to emit the instructions to track the references.
+
+To learn more, check out the [Garbage Collector design document](garbage-collection.md).
+
+## The Concept of "Managed Code"
+
+Code that does the extra bookkeeping so that it can report all of its live GC references "almost all the time" is called _managed code_ (because it is "managed" by the CLR). Code that does not do this is called _unmanaged code_. Thus all code that existed before the CLR is unmanaged code, and in particular, all operating system code is unmanaged.
+
+### The stack unwinding problem
+
+Clearly, because managed code needs the services of the operating system, there will be times when managed code calls unmanaged code. Similarly, because the operating system originally started the managed code, there are also times when unmanaged code calls into managed code. Thus, in general, if you stop a managed program at an arbitrary location, the call stack will have a mixture of frames created by managed code and frames created by unmanaged code.
+
+The stack frames for unmanaged code have _no_ requirements on them over and above running the program. In particular, there is no requirement that they can be _unwound_ at runtime to find their caller. What this means is that if you stop a program at an arbitrary place, and it happens to be in a unmanaged method, there is no way in general<sup>[1]</sup> to find who the caller was. You can only do this in the debugger because of extra information stored in the symbolic information (PDB file). This information is not guaranteed to be available (which is why you sometimes don't get good stack traces in a debugger). This is quite problematic for managed code, because any stack that can't be unwound might in fact contain managed code frames (which contain GC references that need to be reported).
+
+Managed code has additional requirements on it: not only must it track all the GC references it uses during its execution, but it must also be able to unwind to its caller. Additionally, whenever there is a transition from managed code to unmanaged code (or the reverse), managed code must also do additional bookkeeping to make up for the fact that unmanaged code does not know how to unwind its stack frames. Effectively, managed code links together the parts of the stack that contain managed frames. Thus, while it still may be impossible to unwind the unmanaged stack frames without additional information, it will always be possible to find the chunks of the stack that correspond to managed code and to enumerate the managed frames in those chunks.
+
+[1] More recent platform ABIs (application binary interfaces) define conventions for encoding this information, however there is typically not a strict requirement for all code to follow them.
+
+### The "World" of Managed Code
+
+The result is that special bookkeeping is needed at every transition to and from managed code. Managed code effectively lives in its own "world" where execution can't enter or leave unless the CLR knows about it. The two worlds are in a very real sense distinct from one another (at any point in time the code is in the _managed world_ or the _unmanaged world_). Moreover, because the execution of managed code is specified in a CLR format (with its [Common Intermediate Language][cil-spec] (CIL)), and it is the CLR that converts it to run on the native hardware, the CLR has _much_ more control over exactly what that execution does. For example, the CLR could change the meaning of what it means to fetch a field from an object or call a function. In fact, the CLR does exactly this to support the ability to create MarshalByReference objects. These appear to be ordinary local objects, but in fact may exist on another machine. In short, the managed world of the CLR has a large number of _execution hooks_ that it can use to support powerful features which will be explained in more detail in the coming sections.
+
+In addition, there is another important ramification of managed code that may not be so obvious. In the unmanaged world, GC pointers are not allowed (since they can't be tracked), and there is a bookkeeping cost associated with transitioning from managed to unmanaged code. What this means is that while you _can_ call arbitrary unmanaged functions from managed code, it is often not pleasant to do so. Unmanaged methods don't use GC objects in their arguments and return types, which means that any "objects" or "object handles" that those unmanaged functions create and use need to be explicitly deallocated. This is quite unfortunate. Because these APIs can't take advantage of CLR functionality such as exceptions or inheritance, they tend to have a "mismatched" user experience compared to how the interfaces would have been designed in managed code.
+
+The result of this is that unmanaged interfaces are almost always _wrapped_ before being exposed to managed code developers. For example, when accessing files, you don't use the Win32 CreateFile functions provided by the operating system, but rather the managed System.IO.File class that wraps this functionality. It is in fact extremely rare that unmanaged functionality is exposed to users directly.
+
+While this wrapping may seem to be "bad" in some way (more code that does not seem do much), it is in fact good because it actually adds quite a bit of value. Remember it was always _possible_ to expose the unmanaged interfaces directly; we _chose_ to wrap the functionality. Why? Because the overarching goal of the runtime is to **make programming easy**, and typically the unmanaged functions are not easy enough. Most often, unmanaged interfaces are _not_ designed with ease of use in mind, but rather are tuned for completeness. Anyone looking at the arguments to CreateFile or CreateProcess would be hard pressed to characterize them as "easy." Luckily, the functionality gets a "facelift" when it enters the managed world, and while this makeover is often very "low tech" (requiring nothing more complex than renaming, simplification, and organizing the functionality), it is also profoundly useful. One of the very important documents created for the CLR is the [Framework Design Guidelines][fx-design-guidelines]. This 800+ page document details best practices in making new managed class libraries.
+
+Thus, we have now seen that managed code (which is intimately involved with the CLR) differs from unmanaged code in two important ways:
+
+1. High Tech: The code lives in a distinct world, where the CLR controls most aspects of program execution at a very fine level (potentially to individual instructions), and the CLR detects when execution enters and exits managed code. This enables a wide variety of useful features.
+2. Low Tech: The fact that there is a transition cost when going from managed to unmanaged code, as well as the fact that unmanaged code cannot use GC objects encourages the practice of wrapping most unmanaged code in a managed façade. This means interfaces can get a "facelift" to simplify them and to conform to a uniform set of naming and design guidelines that produce a level of consistency and discoverability that could have existed in the unmanaged world, but does not.
+
+**Both** of these characteristics are very important to the success of managed code.
+
+## Memory and Type Safety
+
+One of the less obvious but quite far-reaching features that a garbage collector enables is that of memory safety. The invariant of memory safety is very simple: a program is memory safe if it accesses only memory that has been allocated (and not freed). This simply means that you don't have "wild" (dangling) pointers that are pointing at random locations (more precisely, at memory that was freed prematurely). Clearly, memory safety is a property we want all programs to have. Dangling pointers are always bugs, and tracking them down is often quite difficult.
+
+> A GC _is_ necessary to provide memory safety guarantees
+
+One can quickly see how a garbage collector helps in ensuring memory safety because it removes the possibility that users will prematurely free memory (and thus access memory that was not properly allocated). What may not be so obvious is that if you want to guarantee memory safety (that is make it _impossible_ for programmers to create memory-unsafe programs), practically speaking you can't avoid having a garbage collector. The reason for this is that non-trivial programs need _heap style_ (dynamic) memory allocations, where the lifetime of the objects is essentially under arbitrary program control (unlike stack-allocated, or statically-allocated memory, which has a highly constrained allocation protocol). In such an unconstrained environment, the problem of determining whether a particular explicit delete statement is correct becomes impossible to predict by program analysis. Effectively, the only way you have to determine if a delete is correct is to check it at runtime. This is exactly what a GC does (checks to see if memory is still live). Thus, for any programs that need heap-style memory allocations, if you want to guarantee memory safety, you _need_ a GC.
+
+While a GC is necessary to ensure memory safety, it is not sufficient. The GC will not prevent the program from indexing off the end of an array or accessing a field off the end of an object (possible if you compute the field's address using a base and offset computation). However, if we do prevent these cases, then we can indeed make it impossible for a programmer to create memory-unsafe programs.
+
+While the [common intermediate language][cil-spec] (CIL) _does_ have operators that can fetch and set arbitrary memory (and thus violate memory safety), it also has the following memory-safe operators and the CLR strongly encourages their use in most programming:
+
+1. Field-fetch operators (LDFLD, STFLD, LDFLDA) that fetch (read), set and take the address of a field by name.
+2. Array-fetch operators (LDELEM, STELEM, LDELEMA) that fetch, set and take the address of an array element by index. All arrays include a tag specifying their length. This facilitates an automatic bounds check before each access.
+
+By using these operators instead of the lower-level (and unsafe) _memory-fetch_ operators in user code, as well as avoiding other unsafe [CIL][cil-spec] operators (e.g., those that allow you to jump to arbitrary, and thus possibly bad locations) one could imagine building a system that is memory-safe but nothing more. The CLR does not do this, however. Instead the CLR enforces a stronger invariant: type safety.
+
+For type safety, conceptually each memory allocation is associated with a type. All operators that act on memory locations are also conceptually tagged with the type for which they are valid. Type safety then requires that memory tagged with a particular type can only undergo operations allowed for that type. Not only does this ensure memory safety (no dangling pointers), it also allows additional guarantees for each individual type.
+
+One of the most important of these type-specific guarantees is that the visibility attributes associated with a type (and in particular with fields) are enforced. Thus, if a field is declared to be private (accessible only by the methods of the type), then that privacy will indeed be respected by all other type-safe code. For example, a particular type might declare a count field that represents the count of items in a table. Assuming the fields for the count and the table are private, and assuming that the only code that updates them updates them together, there is now a strong guarantee (across all type-safe code) that the count and the number of items in the table are indeed in sync. When reasoning about programs, programmers use the concept of type safety all the time, whether they know it or not. The CLR elevates type-safety from being simply a programming language/compiler convention, to something that can be strictly enforced at run time.
+
+### Verifiable Code - Enforcing Memory and Type Safety
+
+Conceptually, to enforce type safety, every operation that the program performs has to be checked to ensure that it is operating on memory that was typed in a way that is compatible with the operation. While the system could do this all at runtime, it would be very slow. Instead, the CLR has the concept of [CIL][cil-spec] verification, where a static analysis is done on the [CIL][cil-spec] (before the code is run) to confirm that most operations are indeed type-safe. Only when this static analysis can't do a complete job are runtime checks necessary. In practice, the number of run-time checks needed is actually very small. They include the following operations:
+
+1. Casting a pointer to a base type to be a pointer to a derived type (the opposite direction can be checked statically)
+2. Array bounds checks (just as we saw for memory safety)
+3. Assigning an element in an array of pointers to a new (pointer) value. This particular check is only required because CLR arrays have liberal casting rules (more on that later...)
+
+Note that the need to do these checks places requirements on the runtime. In particular:
+
+1. All memory in the GC heap must be tagged with its type (so the casting operator can be implemented). This type information must be available at runtime, and it must be rich enough to determine if casts are valid (e.g., the runtime needs to know the inheritance hierarchy). In fact, the first field in every object on the GC heap points to a runtime data structure that represents its type.
+2. All arrays must also have their size (for bounds checking).
+3. Arrays must have complete type information about their element type.
+
+Luckily, the most expensive requirement (tagging each heap item) was something that was already necessary to support garbage collection (the GC needs to know what fields in every object contain references that need to be scanned), so the additional cost to provide type safety is low.
+
+Thus, by verifying the [CIL][cil-spec] of the code and by doing a few run-time checks, the CLR can ensure type safety (and memory safety). Nevertheless, this extra safety exacts a price in programming flexibility. While the CLR does have general memory fetch operators, these operators can only be used in very constrained ways for the code to be verifiable. In particular, all pointer arithmetic will fail verification today. Thus many classic C or C++ conventions cannot be used in verifiable code; you must use arrays instead. While this constrains programming a bit, it really is not bad (arrays are quite powerful), and the benefits (far fewer "nasty" bugs), are quite real.
+
+The CLR strongly encourages the use of verifiable, type-safe code. Even so, there are times (mostly when dealing with unmanaged code) that unverifiable programming is needed. The CLR allows this, but the best practice here is to try to confine this unsafe code as much as possible. Typical programs have only a very small fraction of their code that needs to be unsafe, and the rest can be type-safe.
+
+## High Level Features
+
+Supporting garbage collection had a profound effect on the runtime because it requires that all code must support extra bookkeeping. The desire for type-safety also had a profound effect, requiring that the description of the program (the [CIL][cil-spec]) be at a high level, where fields and methods have detailed type information. The desire for type safety also forces the [CIL][cil-spec] to support other high-level programming constructs that are type-safe. Expressing these constructs in a type-safe manner also requires runtime support. The two most important of these high-level features are used to support two essential elements of object oriented programming: inheritance and virtual call dispatch.
+
+### Object Oriented Programming
+
+Inheritance is relatively simple in a mechanical sense. The basic idea is that if the fields of type `derived` are a superset of the fields of type `base`, and `derived` lays out its fields so the fields of `base` come first, then any code that expects a pointer to an instance of `base` can be given a pointer to an instance of `derived` and the code will "just work". Thus, type `derived` is said to inherit from `base`, meaning that it can be used anywhere `base` can be used. Code becomes _polymorphic_ because the same code can be used on many distinct types. Because the runtime needs to know what type coercions are possible, the runtime must formalize the way inheritance is specified so it can validate type safety.
+
+Virtual call dispatch generalizes inheritance polymorphism. It allows base types to declare methods that will be _overridden_ by derived types. Code that uses variables of type `base` can expect that calls to virtual methods will be dispatched to the correct overridden method based on the actual type of the object at run time. While such _run-time dispatch logic_ could have been implemented using primitive [CIL][cil-spec] instructions without direct support in the runtime, it would have suffered from two important disadvantages
+
+1. It would not be type safe (mistakes in the dispatch table are catastrophic errors)
+2. Each object-oriented language would likely implement a slightly different way of implementing its virtual dispatch logic. As result, interoperability among languages would suffer (one language could not inherit from a base type implemented in another language).
+
+For this reason, the CLR has direct support for basic object-oriented features. To the degree possible, the CLR tried to make its model of inheritance "language neutral," in the sense that different languages might still share the same inheritance hierarchy. Unfortunately, that was not always possible. In particular, multiple inheritance can be implemented in many different ways. The CLR chose not to support multiple inheritance on types with fields, but does support multiple inheritance from special types (called interfaces) that are constrained not to have fields.
+
+It is important to keep in mind that while the runtime supports these object-oriented concepts, it does not require their use. Languages without the concept of inheritance (e.g., functional languages) simply don't use these facilities.
+
+### Value Types (and Boxing)
+
+A profound, yet subtle aspect of object oriented programming is the concept of object identity: the notion that objects (allocated by separate allocation calls) can be distinguished, even if all their field values are identical. Object identity is strongly related to the fact that objects are accessed by reference (pointer) rather than by value. If two variables hold the same object (their pointers address the same memory), then updates to one of the variables will affect the other variable.
+
+Unfortunately, the concept of object identity is not a good semantic match for all types. In particular, programmers don't generally think of integers as objects. If the number '1' was allocated at two different places, programmers generally want to consider those two items equal, and certainly don't want updates to one of those instances affecting the other. In fact, a broad class of programming languages called `functional languages' avoid object identity and reference semantics altogether.
+
+While it is possible to have a "pure" object oriented system, where everything (including integers) is an object (Smalltalk-80 does this), a certain amount of implementation "gymnastics" is necessary to undo this uniformity to get an efficient implementation. Other languages (Perl, Java, JavaScript) take a pragmatic view and treat some types (like integers) by value, and others by reference. The CLR also chose a mixed model, but unlike the others, allowed user-defined value types.
+
+The key characteristics of value types are:
+
+1. Each local variable, field, or array element of a value type has a distinct copy of the data in the value.
+2. When one variable, field or array element is assigned to another, the value is copied.
+3. Equality is always defined only in terms of the data in the variable (not its location).
+4. Each value type also has a corresponding reference type which has only one implicit, unnamed field. This is called its boxed value. Boxed value types can participate in inheritance and have object identity (although using the object identity of a boxed value type is strongly discouraged).
+
+Value types very closely model the C (and C++) notion of a struct (or C++ class). Like C you can have pointers to value types, but the pointers are a type distinct from the type of the struct.
+
+### Exceptions
+
+Another high-level programming construct that the CLR directly supports is exceptions. Exceptions are a language feature that allow programmers to _throw_ an arbitrary object at the point that a failure occurs. When an object is thrown, the runtime searches the call stack for a method that declares that it can _catch_ the exception. If such a catch declaration is found, execution continues from that point. The usefulness of exceptions is that they avoid the very common mistake of not checking if a called method fails. Given that exceptions help avoid programmer mistakes (thus making programming easier), it is not surprising that the CLR supports them.
+
+As an aside, while exceptions avoid one common error (not checking for failure), they do not prevent another (restoring data structures to a consistent state in the event of a failure). This means that after an exception is caught, it is difficult in general to know if continuing execution will cause additional errors (caused by the first failure). This is an area where the CLR is likely to add value in the future. Even as currently implemented, however, exceptions are a great step forward (we just need to go further).
+
+### Parameterized Types (Generics)
+
+Previous to version 2.0 of the CLR, the only parameterized types were arrays. All other containers (such as hash tables, lists, queues, etc.), all operated on a generic Object type. The inability to create List<ElemT>, or Dictionary<KeyT, ValueT> certainly had a negative performance effect because value types needed to be boxed on entry to a collection, and explicit casting was needed on element fetch. Nevertheless, that is not the overriding reason for adding parameterized types to the CLR. The main reason is that **parameterized types make programming easier**.
+
+The reason for this is subtle. The easiest way to see the effect is to imagine what a class library would look like if all types were replaced with a generic Object type. This effect is not unlike what happens in dynamically typed languages like JavaScript. In such a world, there are simply far more ways for a programmer to make incorrect (but type-safe) programs. Is the parameter for that method supposed to be a list? a string? an integer? any of the above? It is no longer obvious from looking at the method's signature. Worse, when a method returns an Object, what other methods can accept it as a parameter? Typical frameworks have hundreds of methods; if they all take parameters of type Object, it becomes very difficult to determine which Object instances are valid for the operations the method will perform. In short, strong typing help a programmer express his intent more clearly, and allows tools (e.g., the compiler) to enforce his intent. This results in big productivity boost.
+
+These benefits do not disappear just because the type gets put into a List or a Dictionary, so clearly parameterized types have value. The only real question is whether parameterized types are best thought of as a language specific feature which is "compiled out" by the time CIL is generated, or whether this feature should have first class support in the runtime. Either implementation is certainly possible. The CLR team chose first class support because without it, parameterized types would be implemented different ways by different languages. This would imply that interoperability would be cumbersome at best. In addition, expressing programmer intent for parameterized types is most valuable _at the interface_ of a class library. If the CLR did not officially support parameterized types, then class libraries could not use them, and an important usability feature would be lost.
+
+### Programs as Data (Reflection APIs)
+
+The fundamentals of the CLR are garbage collection, type safety, and high-level language features. These basic characteristics forced the specification of the program (the CIL) to be fairly high level. Once this data existed at run time (something not true for C or C++ programs), it became obvious that it would also be valuable to expose this rich data to end programmers. This idea resulted in the creation of the System.Reflection interfaces (so-called because they allow the program to look at (reflect upon) itself). This interface allows you to explore almost all aspects of a program (what types it has, the inheritance relationship, and what methods and fields are present). In fact, so little information is lost that very good "decompilers" for managed code are possible (e.g., [NET Reflector](http://www.red-gate.com/products/reflector/)). While those concerned with intellectual property protection are aghast at this capability (which can be fixed by purposefully destroying information through an operation called _obfuscating_ the program), the fact that it is possible is a testament to the richness of the information available at run time in managed code.
+
+In addition to simply inspecting programs at run time, it is also possible to perform operations on them (e.g., invoke methods, set fields, etc.), and perhaps most powerfully, to generate code from scratch at run time (System.Reflection.Emit). In fact, the runtime libraries use this capability to create specialized code for matching strings (System.Text.RegularExpressions), and to generate code for "serializing" objects to store in a file or send across the network. Capabilities like this were simply infeasible before (you would have to write a compiler!) but thanks to the runtime, are well within reach of many more programming problems.
+
+While reflection capabilities are indeed powerful, that power should be used with care. Reflection is usually significantly slower than its statically compiled counterparts. More importantly, self-referential systems are inherently harder to understand. This means that powerful features such as Reflection or Reflection.Emit should only be used when the value is clear and substantial.
+
+# Other Features
+
+The last grouping of runtime features are those that are not related to the fundamental architecture of the CLR (GC, type safety, high-level specification), but nevertheless fill important needs of any complete runtime system.
+
+## Interoperation with Unmanaged Code
+
+Managed code needs to be able to use functionality implemented in unmanaged code. There are two main "flavors" of interoperation. First is the ability simply to call unmanaged functions (this is called Platform Invoke or PINVOKE). Unmanaged code also has an object-oriented model of interoperation called COM (component object model) which has more structure than ad hoc method calls. Since both COM and the CLR have models for objects and other conventions (how errors are handled, lifetime of objects, etc.), the CLR can do a better job interoperating with COM code if it has special support.
+
+## Ahead of time Compilation
+
+In the CLR model, managed code is distributed as CIL, not native code. Translation to native code occurs at run time. As an optimization, the native code that is generated from the CIL can be saved in a file using a tool called crossgen (similar to .NET Framework NGEN tool). This avoids large amounts of compilation time at run time and is very important because the class library is so large.
+
+## Threading
+
+The CLR fully anticipated the need to support multi-threaded programs in managed code. From the start, the CLR libraries contained the System.Threading.Thread class which is a 1-to-1 wrapper over the operating system notion of a thread of execution. However, because it is just a wrapper over the operating system thread, creating a System.Threading.Thread is relatively expensive (it takes milliseconds to start). While this is fine for many operations, one style of programming creates very small work items (taking only tens of milliseconds). This is very common in server code (e.g., each task is serving just one web page) or in code that tries to take advantage of multi-processors (e.g., a multi-core sort algorithm). To support this, the CLR has the notion of a ThreadPool which allows WorkItems to be queued. In this scheme, the CLR is responsible for creating the necessary threads to do the work. While the CLR does expose the ThreadPool directly as the System.Threading.Threadpool class, the preferred mechanism is to use the [Task Parallel Library](https://msdn.microsoft.com/en-us/library/dd460717(v=vs.110).aspx), which adds additional support for very common forms of concurrency control.
+
+From an implementation perspective, the important innovation of the ThreadPool is that it is responsible for ensuring that the optimal number of threads are used to dispatch the work. The CLR does this using a feedback system where it monitors the throughput rate and the number of threads and adjusts the number of threads to maximize the throughput. This is very nice because now programmers can think mostly in terms of "exposing parallelism" (that is, creating work items), rather than the more subtle question of determining the right amount of parallelism (which depends on the workload and the hardware on which the program is run).
+
+# Summary and Resources
+
+Phew! The runtime does a lot! It has taken many pages just to describe _some_ of the features of the runtime, without even starting to talk about internal details. The hope is, however, that this introduction will provide a useful framework for a deeper understanding of those internal details. The basic outline of this framework is:
+
+- The Runtime is a complete framework for supporting programming languages
+- The Runtime's goal is to make programming easy.
+- The Fundamental features of the runtime are:
+ - Garbage Collection
+ - Memory and Type Safety
+ - Support for High-Level Language Features
+
+## Useful Links
+
+- [MSDN Entry for the CLR][clr]
+- [Wikipedia Entry for the CLR](http://en.wikipedia.org/wiki/Common_Language_Runtime)
+- [ECMA Standard for the Common Language Infrastructure (CLI)][ecma-spec]
+- [.NET Framework Design Guidelines](http://msdn.microsoft.com/en-us/library/ms229042.aspx)
+- [CoreCLR Repo Documentation](README.md)
+
+[clr]: http://msdn.microsoft.com/library/8bs2ecf4.aspx
+[ecma-spec]: ../project-docs/dotnet-standards.md
+[cil-spec]: http://download.microsoft.com/download/7/3/3/733AD403-90B2-4064-A81E-01035A7FE13C/MS%20Partition%20III.pdf
+[fx-design-guidelines]: http://msdn.microsoft.com/en-us/library/ms229042.aspx
diff --git a/Documentation/botr/method-descriptor.md b/Documentation/botr/method-descriptor.md
new file mode 100644
index 0000000000..bce0bff340
--- /dev/null
+++ b/Documentation/botr/method-descriptor.md
@@ -0,0 +1,343 @@
+Method Descriptor
+=================
+
+Author: Jan Kotas ([@jkotas](https://github.com/jkotas)) - 2006
+
+Introduction
+============
+
+MethodDesc (method descriptor) is the internal representation of a managed method. It serves several purposes:
+
+- Provides a unique method handle, usable throughout the runtime. For normal methods, the MethodDesc is a unique handle for a <module, metadata token, instantiation> triplet.
+- Caches frequently used information that is expensive to compute from metadata (e.g. whether the method is static).
+- Captures the runtime state of the method (e.g. whether the code has been generated for the method already).
+- Owns the entry point of the method.
+
+Design Goals and Non-goals
+--------------------------
+
+### Goals
+
+**Performance:** The design of MethodDesc is heavily optimized for size, since there is one of them for every method. For example, the MethodDesc for a normal non-generic method is 8 bytes in the current design.
+
+### Non-goals
+
+**Richness:** The MethodDesc does not cache all information about the method. It is expected that the underlying metadata has to be accessed for less frequently used information (e.g. method signature).
+
+Design of MethodDesc
+====================
+
+Kinds of MethodDescs
+--------------------
+
+There are multiple kinds of MethodDescs:
+
+**IL**
+
+Used for regular IL methods.
+
+**Instantiated**
+
+Used for less common IL methods that have generic instantiation or that do not have preallocated slot in method table.
+
+**FCall**
+
+Internal methods implemented in unmanaged code. These are [methods marked with MethodImplAttribute(MethodImplOptions.InternalCall) attribute](mscorlib.md), delegate constructors and tlbimp constructors.
+
+**NDirect**
+
+P/Invoke methods. These are methods marked with DllImport attribute.
+
+**EEImpl**
+
+Delegate methods whose implementation is provided by the runtime (Invoke, BeginInvoke, EndInvoke). See [ECMA 335 Partition II - Delegates](../project-docs/dotnet-standards.md).
+
+**Array**
+
+Array methods whose implementation is provided by the runtime (Get, Set, Address). See [ECMA Partition II – Arrays](../project-docs/dotnet-standards.md).
+
+**ComInterop**
+
+COM interface methods. Since the non-generic interfaces can be used for COM interop by default, this kind is usually used for all interface methods.
+
+**Dynamic**
+
+Dynamically created methods without underlying metadata. Produced by Stub-as-IL and LKG (light-weight code generation).
+
+Alternative Implementations
+---------------------------
+
+Virtual methods and inheritance would be the natural way to implement various kinds of MethodDesc in C++. The virtual methods would add vtable pointer to each MethodDesc, wasting a lot of precious space. The vtable pointer occupies 4 bytes on x86. Instead, the virtualization is implemented by switching based on the MethodDesc kind, which fits into 3 bits. For example:
+
+```c++
+DWORD MethodDesc::GetAttrs()
+{
+ if (IsArray())
+ return ((ArrayMethodDesc*)this)->GetAttrs();
+
+ if (IsDynamic())
+ return ((DynamicMethodDesc*)this)->GetAttrs();
+
+ return GetMDImport()->GetMethodDefProps(GetMemberDef());
+}
+```
+
+Method Slots
+------------
+
+Each MethodDesc has a slot, which contains the entry point of the method. The slot and entry point must exist for all methods, even the ones that never run like abstract methods. There are multiple places in the runtime that depend on the 1:1 mapping between entry points and MethodDescs, making this relationship an invariant.
+
+The slot is either in MethodTable or in MethodDesc itself. The location of the slot is determined by `mdcHasNonVtableSlot` bit on MethodDesc.
+
+The slot is stored in MethodTable for methods that require efficient lookup via slot index, e.g. virtual methods or methods on generic types. The MethodDesc contains the slot index to allow fast lookup of the entry point in this case.
+
+Otherwise, the slot is part of the MethodDesc itself. This arrangement improves data locality and saves working set. Also, it is not even always possible to preallocate a slot in a MethodTable upfront for dynamically created MethodDescs, such as for methods added by Edit & Continue, instantiations of generic methods or [dynamic methods](https://github.com/dotnet/coreclr/blob/master/src/mscorlib/src/System/Reflection/Emit/DynamicMethod.cs).
+
+MethodDesc Chunks
+-----------------
+
+The MethodDescs are allocated in chunks to save space. Multiple MethodDesc tend to have identical MethodTable and upper bits of metadata token. MethodDescChunk is formed by hoisting the common information in front of an array of multiple MethodDescs. The MethodDesc contains just the index of itself in the array.
+
+![Figure 1](../images/methoddesc-fig1.png)
+
+Figure 1 MethodDescChunk and MethodTable
+
+Debugging
+---------
+
+The following SOS commands are useful for debugging MethodDesc:
+
+- **DumpMD** – dump the MethodDesc content:
+
+ !DumpMD 00912fd8
+ Method Name: My.Main()
+ Class: 009111ec
+ MethodTable: 00912fe8md
+ Token: 06000001
+ Module: 00912c14
+ IsJitted: yes
+ CodeAddr: 00ca0070
+
+- **IP2MD** – find MethodDesc for given code address:
+
+ !ip2md 00ca007c
+ MethodDesc: 00912fd8
+ Method Name: My.Main()
+ Class: 009111ec
+ MethodTable: 00912fe8md
+ Token: 06000001
+ Module: 00912c14
+ IsJitted: yes
+ CodeAddr: 00ca0070
+
+- **Name2EE** – find MethodDesc for given method name:
+
+ !name2ee hello.exe My.Main
+ Module: 00912c14 (hello.exe)
+ Token: 0x06000001
+ MethodDesc: 00912fd8
+ Name: My.Main()
+ JITTED Code Address: 00ca0070
+
+- **Token2EE** – find MethodDesc for given token (useful for finding MethodDesc for methods with weird names):
+
+ !token2ee hello.exe 0x06000001
+ Module: 00912c14 (hello.exe)
+ Token: 0x06000001
+ MethodDesc: 00912fd
+ 8Name: My.Main()
+ JITTED Code Address: 00ca0070
+
+- **DumpMT** – MD – dump all MethodDescs in the given MethodTable:
+
+ !DumpMT -MD 0x00912fe8
+ ...
+ MethodDesc Table
+ Entry MethodDesc JIT Name
+ 79354bec 7913bd48 PreJIT System.Object.ToString()
+ 793539c0 7913bd50 PreJIT System.Object.Equals(System.Object)
+ 793539b0 7913bd68 PreJIT System.Object.GetHashCode()
+ 7934a4c0 7913bd70 PreJIT System.Object.Finalize()
+ 00ca0070 00912fd8 JIT My.Main()
+ 0091303c 00912fe0 NONE My..ctor()
+
+A MethodDesc has fields with the name and signature of the method on debug builds. This is useful for debugging when the runtime state is severely corrupted and the SOS extension does not work.
+
+Precode
+=======
+
+The precode is a small fragment of code used to implement temporary entry points and an efficient wrapper for stubs. Precode is a niche code-generator for these two cases, generating the most efficient code possible. In an ideal world, all native code dynamically generated by the runtime would be produced by the JIT. That's not feasible in this case, given the specific requirements of these two scenarios. The basic precode on x86 may look like this:
+
+ mov eax,pMethodDesc // Load MethodDesc into scratch register
+ jmp target // Jump to a target
+
+**Efficient Stub wrappers:** The implementation of certain methods (e.g. P/Invoke, delegate invocation, multi dimensional array setters and getters) is provided by the runtime, typically as hand-written assembly stubs. Precode provides a space-efficient wrapper over stubs, to multiplex them for multiple callers.
+
+The worker code of the stub is wrapped by a precode fragment that can be mapped to the MethodDesc and that jumps to the worker code of the stub. The worker code of the stub can be shared between multiple methods this way. It is an important optimization used to implement P/Invoke marshalling stubs. It also creates a 1:1 mapping between MethodDescs and entry points, which establishes a simple and efficient low-level system.
+
+**Temporary entry points:** Methods must provide entry points before they are jitted so that jitted code has an address to call them. These temporary entry points are provided by precode. They are a specific form of stub wrappers.
+
+This technique is a lazy approach to jitting, which provides a performance optimization in both space and time. Otherwise, the transitive closure of a method would need to be jitted before it was executed. This would be a waste, since only the dependencies of taken code branches (e.g. if statement) require jitting.
+
+Each temporary entry point is much smaller than a typical method body. They need to be small since there are a lot of them, even at the cost of performance. The temporary entry points are executed just once before the actual code for the method is generated.
+
+The target of the temporary entry point is a PreStub, which is a special kind of stub that triggers jitting of a method. It atomically replaces the temporary entry point with a stable entry point. The stable entry point has to remain constant for the method lifetime. This invariant is required to guarantee thread safety since the method slot is always accessed without any locks taken.
+
+The **stable entry point** is either the native code or the precode. The **native code** is either jitted code or code saved in NGen image. It is common to talk about jitted code when we actually mean native code.
+
+Temporary entry points are never saved into NGen images. All entry points in NGen images are stable entry points that are never changed. It is an important optimization that reduced private working set.
+
+![Figure 2](../images/methoddesc-fig2.png)
+
+Figure 2 Entry Point State Diagram
+
+A method can have both native code and precode if there is a need to do work before the actual method body is executed. This situation typically happens for NGen image fixups. Native code is an optional MethodDesc slot in this case. This is necessary to lookup the native code of the method in a cheap uniform way.
+
+![Figure 3](../images/methoddesc-fig3.png)
+
+Figure 3 The most complex case of Precode, Stub and Native Code
+
+Single Callable vs. Multi Callable entry points
+-----------------------------------------------
+
+Entry point is needed to call the method. The MethodDesc exposes methods that encapsulate logic to get the most efficient entry point for the given situation. The key difference is whether the entry point will be used to call the method just once or whether it will be used to call the method multiple times.
+
+For example, it may be a bad idea to use the temporary entry point to call the method multiple times since it would go through the PreStub each time. On the other hand, using temporary entry point to call the method just once should be fine.
+
+The methods to get callable entry points from MethodDesc are:
+
+- MethodDesc::GetSingleCallableAddrOfCode
+- MethodDesc::GetMultiCallableAddrOfCode
+- MethodDesc::GetSingleCallableAddrOfVirtualizedCode
+- MethodDesc::GetMultiCallableAddrOfVirtualizedCode
+
+Types of precode
+----------------
+
+There are multiple specialized types of precodes.
+
+The type of precode has to be cheaply computable from the instruction sequence. On x86 and x64, the type of precode is computed by fetching a byte at a constant offset. Of course, this imposes limits on the instruction sequences used to implement the various precode types.
+
+**StubPrecode**
+
+StubPrecode is the basic precode type. It loads MethodDesc into a scratch register and then jumps. It must be implemented for precodes to work. It is used as fallback when no other specialized precode type is available.
+
+All other precodes types are optional optimizations that the platform specific files turn on via HAS\_XXX\_PRECODE defines.
+
+StubPrecode looks like this on x86:
+
+ mov eax,pMethodDesc
+ mov ebp,ebp // dummy instruction that marks the type of the precode
+ jmp target
+
+"target" points to prestub initially. It is patched to point to the final target. The final target (stub or native code) may or may not use MethodDesc in eax. Stubs often use it, native code does not use it.
+
+**FixupPrecode**
+
+FixupPrecode is used when the final target does not require MethodDesc in scratch register<sup>2</sup>. The FixupPrecode saves a few cycles by avoiding loading MethodDesc into the scratch register.
+
+The most common usage of FixupPrecode is for method fixups in NGen images.
+
+The initial state of the FixupPrecode on x86:
+
+ call PrecodeFixupThunk // This call never returns. It pops the return address
+ // and uses it to fetch the pMethodDesc below to find
+ // what the method that needs to be jitted
+ pop esi // dummy instruction that marks the type of the precode
+ dword pMethodDesc
+
+Once it has been patched to point to final target:
+
+ jmp target
+ pop edi
+ dword pMethodDesc
+
+<sup>2</sup> Passing MethodDesc in scratch register is sometimes referred to as **MethodDesc Calling Convention**.
+
+**FixupPrecode chunks**
+
+FixupPrecode chunk is a space efficient representation of multiple FixupPrecodes. It mirrors the idea of MethodDescChunk by hoisting the similar MethodDesc pointers from multiple FixupPrecodes to a shared area.
+
+The FixupPrecode chunk saves space and improves code density of the precodes. The code density improvement from FixupPrecode chunks resulted in 1% - 2% gain in big server scenarios on x64.
+
+The FixupPrecode chunks looks like this on x86:
+
+ jmp Target2
+ pop edi // dummy instruction that marks the type of the precode
+ db MethodDescChunkIndex
+ db 2 (PrecodeChunkIndex)
+
+ jmp Target1
+ pop edi
+ db MethodDescChunkIndex
+ db 1 (PrecodeChunkIndex)
+
+ jmp Target0
+ pop edi
+ db MethodDescChunkIndex
+ db 0 (PrecodeChunkIndex)
+
+ dw pMethodDescBase
+
+One FixupPrecode chunk corresponds to one MethodDescChunk. There is no 1:1 mapping between the FixupPrecodes in the chunk and MethodDescs in MethodDescChunk though. Each FixupPrecode has index of the method it belongs to. It allows allocating the FixupPrecode in the chunk only for methods that need it.
+
+**Compact entry points**
+
+Compact entry point is a space efficient implementation of temporary entry points.
+
+Temporary entry points implemented using StubPrecode or FixupPrecode can be patched to point to the actual code. Jitted code can call temporary entry point directly. The temporary entry point can be multicallable entry points in this case.
+
+Compact entry points cannot be patched to point to the actual code. Jitted code cannot call them directly. They are trading off speed for size. Calls to these entry points are indirected via slots in a table (FuncPtrStubs) that are patched to point to the actual entry point eventually. A request for a multicallable entry point allocates a StubPrecode or FixupPrecode on demand in this case.
+
+The raw speed difference is the cost of an indirect call for a compact entry point vs. the cost of one direct call and one direct jump on the given platform. The the later used to be faster by a few percent in large server scenario since it can be predicted by the hardware better (2005). It is not always the case on current (2015) hardware.
+
+The compact entry points have been historically implemented on x86 only. Their additional complexity, space vs. speed trade-off and hardware advancements made them unjustified on other platforms.
+
+The compact entry point on x86 looks like this:
+
+ entrypoint0:
+ mov al,0
+ jmp short Dispatch
+
+ entrypoint1:
+ mov al,1
+ jmp short Dispatch
+
+ entrypoint2:
+ mov al,2
+ jmp short Dispatch
+
+ Dispatch:
+ movzx eax,al
+ shl eax, 3
+ add eax, pBaseMD
+ jmp PreStub
+
+The allocation of temporary entry points always tries to pick the smallest temporary entry point from the available choices. For example, a single compact entry point is bigger than a single StubPrecode on x86. The StubPrecode will be preferred over the compact entry point in this case. The allocation of the precode for a stable entry point will try to reuse an allocated temporary entry point precode if one exists of the matching type.
+
+**ThisPtrRetBufPrecode**
+
+ThisPtrRetBufPrecode is used to switch a return buffer and the this pointer for open instance delegates returning valuetypes. It is used to convert the calling convention of MyValueType Bar(Foo x) to the calling convention of MyValueType Foo::Bar().
+
+This precode is always allocated on demand as a wrapper of the actual method entry point and stored in a table (FuncPtrStubs).
+
+ThisPtrRetBufPrecode looks like this:
+
+ mov eax,ecx
+ mov ecx,edx
+ mov edx,eax
+ nop
+ jmp entrypoint
+ dw pMethodDesc
+
+**NDirectImportPrecode**
+
+NDirectImportPrecode is used for lazy binding of unmanaged P/Invoke targets. This precode is for convenience and to reduce amount of platform specific plumbing.
+
+Each NDirectMethodDesc has NDirectImportPrecode in addition to the regular precode.
+
+NDirectImportPrecode looks like this on x86:
+
+ mov eax,pMethodDesc
+ mov eax,eax // dummy instruction that marks the type of the precode
+ jmp NDirectImportThunk // loads P/Invoke target for pMethodDesc lazily
diff --git a/Documentation/botr/mscorlib.md b/Documentation/botr/mscorlib.md
new file mode 100644
index 0000000000..5b5046e5bc
--- /dev/null
+++ b/Documentation/botr/mscorlib.md
@@ -0,0 +1,357 @@
+Mscorlib and Calling Into the Runtime
+===
+
+Author: Brian Grunkemeyer ([@briangru](https://github.com/briangru)) - 2006
+
+# Introduction
+
+Mscorlib is the assembly for defining the core parts of the type system, and a good portion of the Base Class Library. Base data types live in this assembly, and it has a tight coupling with the CLR. Here you will learn exactly how & why mscorlib.dll is special, and the basics about calling into the CLR from managed code via QCall and FCall methods. It also discusses calling from within the CLR into managed code.
+
+## Dependencies
+
+Since mscorlib defines base data types like Object, Int32, and String, mscorlib cannot depend on other managed assemblies. However, there is a strong dependency between mscorlib and the CLR. Many of the types in mscorlib need to be accessed from native code, so the layout of many managed types is defined both in managed code and in native code inside the CLR. Additionally, some fields may be defined only in debug or checked builds, so typically mscorlib must be compiled separately for checked vs. retail builds.
+
+For 64 bit platforms, some constants are also defined at compile time. So a 64 bit mscorlib.dll is slightly different from a 32 bit mscorlib.dll. Due to these constants, such as IntPtr.Size, most libraries above mscorlib should not need to build separately for 32 bit vs. 64 bit.
+
+## What Makes Mscorlib Special?
+
+Mscorlib has several unique properties, many of which are due to its tight coupling to the CLR.
+
+- Mscorlib defines the core types necessary to implement the CLR's Virtual Object System, such as the base data types (Object, Int32, String, etc).
+- The CLR must load mscorlib on startup to load certain system types.
+- Can only have one mscorlib loaded in the process at a time, due to layout issues. Loading multiple mscorlibs would require formalizing a contract of behavior, FCall methods, and datatype layout between CLR & mscorlib, and keeping that contract relatively stable across versions.
+- Mscorlib's types will be used heavily for native interop, and managed exceptions should map correctly to native error codes/formats.
+- The CLR's multiple JIT compilers may special case a small group of certain methods in mscorlib for performance reasons, both in terms of optimizing away the method (such as Math.Cos(double)), or calling a method in peculiar ways (such as Array.Length, or some implementation details on StringBuilder for getting the current thread).
+- Mscorlib will need to call into native code, via P/Invoke where appropriate, primarily into the underlying operating system or occasionally a platform adaptation layer.
+- Mscorlib will require calling into the CLR to expose some CLR-specific functionality, such as triggering a garbage collection, to load classes, or to interact with the type system in a non-trivial way. This requires a bridge between managed code and native, "manually managed" code within the CLR.
+- The CLR will need to call into managed code to call managed methods, and to get at certain functionality that is only implemented in managed code.
+
+# Interface between managed & CLR code
+
+To reiterate, the needs of managed code in mscorlib include:
+
+- The ability to access fields of some managed data structures in both managed code and "manually managed" code within the CLR
+- Managed code must be able to call into the CLR
+- The CLR must be able to call managed code.
+
+To implement these, we need a way for the CLR to specify and optionally verify the layout of a managed object in native code, a managed mechanism for calling into native code, and a native mechanism for calling into managed code.
+
+The managed mechanism for calling into native code must also support the special managed calling convention used by String's constructors, where the constructor allocates the memory used by the object (instead of the typical convention where the constructor is called after the GC allocates memory).
+
+The CLR provides a [mscorlib binder](https://github.com/dotnet/coreclr/blob/master/src/vm/binder.cpp) internally, providing a mapping between unmanaged types and fields to managed types & fields. The binder will look up & load classes, allow you to call managed methods. It also does some simple verification to ensure the correctness of any layout information specified in both managed & native code. The binder ensures that the managed class you're attempting to use exists in mscorlib, has been loaded, and the field offsets are correct. It also needs the ability to differentiate between method overloads with different signatures.
+
+# Calling from managed to native code
+
+We have two techniques for calling into the CLR from managed code. FCall allows you to call directly into the CLR code, and provides a lot of flexibility in terms of manipulating objects, though it is easy to cause GC holes by not tracking object references correctly. QCall allows you to call into the CLR via the P/Invoke, and is much harder to accidentally mis-use than FCall. FCalls are identified in managed code as extern methods with the MethodImplOptions.InternalCall bit set. QCalls are _static_ extern methods that look like regular P/Invokes, but to a library called "QCall".
+
+There is a small variant of FCall called HCall (for Helper call) for implementing JIT helpers, for doing things like accessing multi-dimensional array elements, range checks, etc. The only difference between HCall and FCall is that HCall methods won't show up in an exception stack trace.
+
+### Choosing between FCall, QCall, P/Invoke, and writing in managed code
+
+First, remember that you should be writing as much as possible in managed code. You avoid a raft of potential GC hole issues, you get a good debugging experience, and the code is often simpler. It also is preparation for ongoing refactoring of mscorlib into smaller layered fully managed libraries in [corefx](https://github.com/dotnet/corefx/).
+
+Reasons to write FCalls in the past generally fell into three camps: missing language features, better performance, or implementing unique interactions with the runtime. C# now has almost every useful language feature that you could get from C++, including unsafe code & stack-allocated buffers, and this eliminates the first two reasons for FCalls. We have ported some parts of the CLR that were heavily reliant on FCalls to managed code in the past (such as Reflection and some Encoding & String operations), and we want to continue this momentum. We may port our number formatting & String comparison code to managed in the future.
+
+If the only reason you're defining a FCall method is to call a native Win32 method, you should be using P/Invoke to call Win32 directly. P/Invoke is the public native method interface, and should be doing everything you need in a correct manner.
+
+If you still need to implement a feature inside the runtime, now consider if there is a way to reduce the frequency of transitioning to native code. Can you write the common case in managed, and only call into native for some rare corner cases? You're usually best off keeping as much as possible in managed code.
+
+QCalls are the preferred mechanism going forward. You should only use FCalls when you are "forced" to. This happens when there is common "short path" through the code that is important to optimize. This short path should not be more than a few hundred instructions, cannot allocate GC memory, take locks or throw exceptions (GC_NOTRIGGER, NOTHROWS). In all other circumstances (and especially when you enter a FCall and then simply erect HelperMethodFrame), you should be using QCall.
+
+FCalls were specifically designed for short paths of code that must be optimized. They allowed you to take explicit control over when erecting a frame was done. However it is error prone and is not worth it for many APIs. QCalls are essentially P/Invokes into CLR.
+
+As a result, QCalls give you some advantageous marshaling for SafeHandles automatically – your native method just takes a HANDLE type, and can use it without worrying whether someone will free the handle while you are in that method body. The resulting FCall method would need to use a SafeHandleHolder, and may need to protect the SafeHandle, etc. Leveraging the P/Invoke marshaler can avoid this additional plumbing code.
+
+## QCall Functional Behavior
+
+QCalls are very much like a normal P/Invoke from mscorlib.dll to CLR. Unlike FCalls, QCalls will marshal all arguments as unmanaged types like a normal P/Invoke. QCall also switch to preemptive GC mode like a normal P/Invoke. These two features should make QCalls easier to write reliably compared to FCalls. QCalls are not prone to GC holes and GC starvation bugs that are common with FCalls.
+
+QCalls perform better than FCalls that erect a HelperMethodFrame. The overhead is about 1.4x less compared to FCall w/ HelperMethodFrame overhead on x86 and x64.
+
+The preferred types for QCall arguments are primitive types that are efficiently handled by the P/Invoke marshaler (INT32, LPCWSTR, BOOL). Notice that BOOL is the correct boolean flavor for QCall arguments. On the other hand, CLR_BOOL is the correct boolean flavor for FCall arguments.
+
+The pointers to common unmanaged EE structures should be wrapped into handle types. This is to make the managed implementation type safe and avoid falling into unsafe C# everywhere. See AssemblyHandle in [vm\qcall.h][qcall] for an example.
+
+[qcall]: https://github.com/dotnet/coreclr/blob/master/src/vm/qcall.h
+
+There is a way to pass a raw object references in and out of QCalls. It is done by wrapping a pointer to a local variable in a handle. It is intentionally cumbersome and should be avoided if reasonably possible. See the StringHandleOnStack in the example below. Returning objects, especially strings, from QCalls is the only common pattern where passing the raw objects is widely acceptable. (For reasoning on why this set of restrictions helps make QCalls less prone to GC holes, read the "GC Holes, FCall, and QCall" section below.)
+
+### QCall Example - Managed Part
+
+Do not replicate the comments into your actual QCall implementation. This is for illustrative purposes.
+
+ class Foo
+ {
+ // All QCalls should have the following DllImport and
+ // SuppressUnmanagedCodeSecurity attributes
+ [DllImport(JitHelpers.QCall, CharSet = CharSet.Unicode)]
+ [SuppressUnmanagedCodeSecurity]
+ // QCalls should always be static extern.
+ private static extern bool Bar(int flags, string inString, StringHandleOnStack retString);
+
+ // Many QCalls have a thin managed wrapper around them to expose them to
+ // the world in more meaningful way.
+ public string Bar(int flags)
+ {
+ string retString = null;
+
+ // The strings are returned from QCalls by taking address
+ // of a local variable using JitHelpers.GetStringHandle method
+ if (!Bar(flags, this.Id, JitHelpers.GetStringHandle(ref retString)))
+ FatalError();
+
+ return retString;
+ }
+ }
+
+### QCall Example - Unmanaged Part
+
+Do not replicate the comments into your actual QCall implementation.
+
+The QCall entrypoint has to be registered in tables in [vm\ecalllist.h][ecalllist] using QCFuncEntry macro. See "Registering your QCall or FCall Method" below.
+
+[ecalllist]: https://github.com/dotnet/coreclr/blob/master/src/vm/ecalllist.h
+
+ class FooNative
+ {
+ public:
+ // All QCalls should be static and should be tagged with QCALLTYPE
+ static
+ BOOL QCALLTYPE Bar(int flags, LPCWSTR wszString, QCall::StringHandleOnStack retString);
+ };
+
+ BOOL QCALLTYPE FooNative::Bar(int flags, LPCWSTR wszString, QCall::StringHandleOnStack retString)
+ {
+ // All QCalls should have QCALL_CONTRACT.
+ // It is alias for THROWS; GC_TRIGGERS; MODE_PREEMPTIVE; SO_TOLERANT.
+ QCALL_CONTRACT;
+
+ // Optionally, use QCALL_CHECK instead and the expanded form of the contract
+ // if you want to specify preconditions:
+ // CONTRACTL {
+ // QCALL_CHECK;
+ // PRECONDITION(wszString != NULL);
+ // } CONTRACTL_END;
+
+ // The only line between QCALL_CONTRACT and BEGIN_QCALL
+ // should be the return value declaration if there is one.
+ BOOL retVal = FALSE;
+
+ // The body has to be enclosed in BEGIN_QCALL/END_QCALL macro. It is necessary
+ // to make the exception handling work.
+ BEGIN_QCALL;
+
+ // Validate arguments if necessary and throw exceptions.
+ // There is no convention currently on whether the argument validation should be
+ // done in managed or unmanaged code.
+ if (flags != 0)
+ COMPlusThrow(kArgumentException, L"InvalidFlags");
+
+ // No need to worry about GC moving strings passed into QCall.
+ // Marshalling pins them for us.
+ printf("%S", wszString);
+
+ // This is most the efficient way to return strings back
+ // to managed code. No need to use StringBuilder.
+ retString.Set(L"Hello");
+
+ // You can not return from inside of BEGIN_QCALL/END_QCALL.
+ // The return value has to be passed out in helper variable.
+ retVal = TRUE;
+
+ END_QCALL;
+
+ return retVal;
+ }
+
+## FCall Functional Behavior
+
+FCalls allow more flexibility in terms of passing object references around, with a higher code complexity and more opportunities to hang yourself. Additionally, FCall methods must either erect a helper method frame along their common code paths, or for any FCall of non-trivial length, explicitly poll for whether a garbage collection must occur. Failing to do so will lead to starvation issues if managed code repeatedly calls the FCall method in a tight loop, because FCalls execute while the thread only allows the GC to run in a cooperative manner.
+
+FCalls require a lot of glue, too much to describe here. Look at [fcall.h][fcall] for details.
+
+[fcall]: https://github.com/dotnet/coreclr/blob/master/src/vm/fcall.h
+
+### GC Holes, FCall, and QCall
+
+A much more complete discussion on GC holes can be found in the [CLR Code Guide](../coding-guidelines/clr-code-guide.md). Look for ["Is your code GC-safe?"](../coding-guidelines/clr-code-guide.md#is-your-code-gc-safe). This tailored discussion motivates some of the reasons why FCall and QCall have some of their strange conventions.
+
+Object references passed as parameters to FCall methods are not GC-protected, meaning that if a GC occurs, those references will point to the old location in memory of an object, not the new location. For this reason, FCalls usually follow the discipline of accepting something like "StringObject*" as their parameter type, then explicitly converting that to a STRINGREF before doing operations that may trigger a GC. You must GC protect object references before triggering a GC, if you expect to be able to use that object reference later.
+
+All GC heap allocations within an FCall method must happen within a helper method frame. If you allocate memory on the GC's heap, the GC may collect dead objects & move objects around in unpredictable ways, with some low probability. For this reason, you must manually report any object references in your method to the GC, so that if a garbage collection occurs, your object reference will be updated to refer to the new location in memory. Any pointers into managed objects (like arrays or Strings) within your code will not be updated automatically, and must be re-fetched after any operation that may allocate memory and before your first usage. Reporting a reference can be done via the GCPROTECT macros, or as parameters when you erect a helper method frame.
+
+Failing to properly report an OBJECTREF or to update an interior pointer is commonly referred to as a "GC hole", because the OBJECTREF class will do some validation that it points to a valid object every time you dereference it in checked builds. When an OBJECTREF pointing to an invalid object is dereferenced, you'll get an assert saying something like "Detected an invalid object reference. Possible GC hole?". This assert is unfortunately easy to hit when writing "manually managed" code.
+
+Note that QCall's programming model is restrictive to sidestep GC holes most of the time, by forcing you to pass in the address of an object reference on the stack. This guarantees that the object reference is GC protected by the JIT's reporting logic, and that the actual object reference will not move because it is not allocated in the GC heap. QCall is our recommended approach, precisely because it makes GC holes harder to write.
+
+### FCall Epilogue Walker for x86
+
+The managed stack walker needs to be able to find its way from FCalls. It is relative easy on newer platforms that define conventions for stack unwinding as part of the ABI. The stack unwinding conventions are not defined by ABI for x86. The runtime works around it by implementing a epilog walker. The epilog walker computes the FCall return address and callee save registers by simulating the FCall execution. This imposes limits on what constructs are allowed in the FCall implementation.
+
+Complex constructs like stack allocated objects with destructors or exception handling in the FCall implementation may confuse the epilog walker. It leads to GC holes or crashes during stack walking. There is no exact list of what constructs should be avoided to prevent this class of bugs. An FCall implementation that is fine one day may break with the next C++ compiler update. We depend on stress runs & code coverage to find bugs in this area.
+
+Setting a breakpoint inside an FCall implementation may confuse the epilog walker. It leads to an "Invalid breakpoint in a helpermethod frame epilog" assert inside [vm\i386\gmsx86.cpp](https://github.com/dotnet/coreclr/blob/master/src/vm/i386/gmsx86.cpp).
+
+### FCall Example – Managed Part
+
+Here's a real-world example from the String class:
+
+ public partial sealed class String
+ {
+ // Replaces all instances of oldChar with newChar.
+ [MethodImplAttribute(MethodImplOptions.InternalCall)]
+ public extern String Replace (char oldChar, char newChar);
+ }
+
+### FCall Example – Native Part
+
+The FCall entrypoint has to be registered in tables in [vm\ecalllist.h][ecalllist] using FCFuncEntry macro. See "Registering your QCall or FCall Method".
+
+Notice how oldBuffer and newBuffer (interior pointers into String instances) are re-fetched after allocating memory. Also, this method is an instance method in managed code, with the "this" parameter passed as the first argument. We use StringObject* as the argument type, then copy it into a STRINGREF so we get some error checking when we use it.
+
+ FCIMPL3(LPVOID, COMString::Replace, StringObject* thisRefUNSAFE, CLR_CHAR oldChar, CLR_CHAR newChar)
+ {
+ FCALL_CONTRACT;
+
+ int length = 0;
+ int firstFoundIndex = -1;
+ WCHAR *oldBuffer = NULL;
+ WCHAR *newBuffer;
+
+ STRINGREF newString = NULL;
+ STRINGREF thisRef = (STRINGREF)thisRefUNSAFE;
+
+ if (thisRef==NULL) {
+ FCThrowRes(kNullReferenceException, L"NullReference_This");
+ }
+
+ [... Removed some uninteresting code here for illustrative purposes...]
+
+ HELPER_METHOD_FRAME_BEGIN_RET_ATTRIB_2(Frame::FRAME_ATTR_RETURNOBJ, newString, thisRef);
+
+ //Get the length and allocate a new String
+ //We will definitely do an allocation here.
+ newString = NewString(length);
+
+ //After allocation, thisRef may have moved
+ oldBuffer = thisRef->GetBuffer();
+
+ //Get the buffers in both of the Strings.
+ newBuffer = newString->GetBuffer();
+
+ //Copy the characters, doing the replacement as we go.
+ for (int i=0; i<firstFoundIndex; i++) {
+ newBuffer[i]=oldBuffer[i];
+ }
+ for (int i=firstFoundIndex; i<length; i++) {
+ newBuffer[i]=(oldBuffer[i]==((WCHAR)oldChar))?
+ ((WCHAR)newChar):oldBuffer[i];
+ }
+
+ HELPER_METHOD_FRAME_END();
+
+ return OBJECTREFToObject(newString);
+ }
+ FCIMPLEND
+
+
+## Registering your QCall or FCall Method
+
+The CLR must know the name of your QCall and FCall methods, both in terms of the managed class & method names, as well as which native methods to call. That is done in [ecalllist.h][ecalllist], with two arrays. The first array maps namespace & class names to an array of function elements. That array of function elements then maps individual method names & signatures to function pointers.
+
+Say we defined an FCall method for String.Replace(char, char), in the example above. First, we need to ensure that we have an array of function elements for the String class.
+
+ // Note these have to remain sorted by name:namespace pair (Assert will wack you if you
+ ...
+ FCClassElement("String", "System", gStringFuncs)
+ ...
+
+Second, we must then ensure that gStringFuncs contains a proper entry for Replace. Note that if a method name has multiple overloads (such as String.Replace(String, String)), then we can specify a signature:
+
+ FCFuncStart(gStringFuncs)
+ ...
+ FCFuncElement("IndexOf", COMString::IndexOfChar)
+ FCFuncElementSig("Replace", &gsig_IM_Char_Char_RetStr, COMString::Replace)
+ FCFuncElementSig("Replace", &gsig_IM_Str_Str_RetStr, COMString::ReplaceString)
+ ...
+ FCFuncEnd()
+
+There is a parallel QCFuncElement macro.
+
+## Naming convention
+
+Try to use normal name (e.g. no "_", "n" or "native" prefix) for all FCalls and QCalls. It is not good idea to embed that the function is implemented in VM in the name of the function for the following reasons:
+
+- There are directly exposed public FCalls. These FCalls have to follow the naming convention for public APIs.
+- The implementation of functions do move between CLR and mscorlib.dll. It is painful to change the name of the function in all call sites when this happens.
+
+When necessary you can use "Internal" prefix to disambiguate the name of the FCall or QCall from public entry point (e.g. the public entry point does error checking and then calls shared worker function with exactly same signature). This is no different from how you would deal with this situation in pure managed code in BCL.
+
+# Types with a Managed/Unmanaged Duality
+
+Certain managed types must have a representation available in both managed & native code. You could ask whether the canonical definition of a type is in managed code or native code within the CLR, but the answer doesn't matter – the key thing is they must both be identical. This will allow the CLR's native code to access fields within a managed object in a very fast, easy to use manner. There is a more complex way of using essentially the CLR's equivalent of Reflection over MethodTables & FieldDescs to retrieve field values, but this probably doesn't perform as well as you'd like, and it isn't very usable. For commonly used types, it makes sense to declare a data structure in native code & attempt to keep the two in sync.
+
+The CLR provides a binder for this purpose. After you define your managed & native classes, you should provide some clues to the binder to help ensure that the field offsets remain the same, to quickly spot when someone accidentally adds a field to only one definition of a type.
+
+In [mscorlib.h][mscorlib.h], you can use macros ending in "_U" to describe a type, the name of fields in managed code, and the name of fields in a corresponding native data structure. Additionally, you can specify a list of methods, and reference them by name when you attempt to call them later.
+
+[mscorlib.h]: https://github.com/dotnet/coreclr/blob/master/src/vm/mscorlib.h
+
+ DEFINE_CLASS_U(SAFE_HANDLE, Interop, SafeHandle, SafeHandle)
+ DEFINE_FIELD(SAFE_HANDLE, HANDLE, handle)
+ DEFINE_FIELD_U(SAFE_HANDLE, STATE, _state, SafeHandle, m_state)
+ DEFINE_FIELD_U(SAFE_HANDLE, OWNS_HANDLE, _ownsHandle, SafeHandle, m_ownsHandle)
+ DEFINE_FIELD_U(SAFE_HANDLE, INITIALIZED, _fullyInitialized, SafeHandle, m_fullyInitialized)
+ DEFINE_METHOD(SAFE_HANDLE, GET_IS_INVALID, get_IsInvalid, IM_RetBool)
+ DEFINE_METHOD(SAFE_HANDLE, RELEASE_HANDLE, ReleaseHandle, IM_RetBool)
+ DEFINE_METHOD(SAFE_HANDLE, DISPOSE, Dispose, IM_RetVoid)
+ DEFINE_METHOD(SAFE_HANDLE, DISPOSE_BOOL, Dispose, IM_Bool_RetVoid)
+
+
+Then, you can use the REF<T> template to create a type name like SAFEHANDLEREF. All the error checking from OBJECTREF is built into the REF<T> macro, and you can freely dereference this SAFEHANDLEREF & use fields off of it in native code. You still must GC protect these references.
+
+# Calling Into Managed Code From Native
+
+Clearly there are places where the CLR must call into managed code from native. For this purpose, we have added a MethodDescCallSite class to handle a lot of plumbing for you. Conceptually, all you need to do is find the MethodDesc\* for the method you want to call, find a managed object for the "this" pointer (if you're calling an instance method), pass in an array of arguments, and deal with the return value. Internally, you'll need to potentially toggle your thread's state to allow the GC to run in preemptive mode, etc.
+
+Here's a simplified example. Note how this instance uses the binder described in the previous section to call SafeHandle's virtual ReleaseHandle method.
+
+ void SafeHandle::RunReleaseMethod(SafeHandle* psh)
+ {
+ CONTRACTL {
+ THROWS;
+ GC_TRIGGERS;
+ MODE_COOPERATIVE;
+ } CONTRACTL_END;
+
+ SAFEHANDLEREF sh(psh);
+
+ GCPROTECT_BEGIN(sh);
+
+ MethodDescCallSite releaseHandle(s_pReleaseHandleMethod, METHOD__SAFE_HANDLE__RELEASE_HANDLE, (OBJECTREF*)&sh, TypeHandle(), TRUE);
+
+ ARG_SLOT releaseArgs[] = { ObjToArgSlot(sh) };
+ if (!(BOOL)releaseHandle.Call_RetBool(releaseArgs)) {
+ MDA_TRIGGER_ASSISTANT(ReleaseHandleFailed, ReportViolation)(sh->GetTypeHandle(), sh->m_handle);
+ }
+
+ GCPROTECT_END();
+ }
+
+# Interactions with Other Subsystems
+
+## Debugger
+
+One limitation of FCalls today is that you cannot easily debug both managed code and FCalls easily in Visual Studio's Interop (or mixed mode) debugging. Setting a breakpoint today in an FCall and debugging with Interop debugging just doesn't work. This most likely won't be fixed.
+
+# Physical Architecture
+
+When the CLR starts up, mscorlib is loaded by a method called LoadBaseSystemClasses. Here, the base data types & other similar classes (like Exception) are loaded, and appropriate global pointers are set up to refer to mscorlib's types.
+
+For FCalls, look in [fcall.h][fcall] for infrastructure, and [ecalllist.h][ecalllist] to properly inform the runtime about your FCall method.
+
+For QCalls, look in [qcall.h][qcall] for associated infrastructure, and [ecalllist.h][ecalllist] to properly inform the runtime about your QCall method.
+
+More general infrastructure and some native type definitions can be found in [object.h][object.h]. The binder uses mscorlib.h to associate managed & native classes.
+
+[object.h]: https://github.com/dotnet/coreclr/blob/master/src/vm/object.h
diff --git a/Documentation/botr/porting-ryujit.md b/Documentation/botr/porting-ryujit.md
new file mode 100644
index 0000000000..8eb0b0dbcf
--- /dev/null
+++ b/Documentation/botr/porting-ryujit.md
@@ -0,0 +1,112 @@
+# RyuJIT: Porting to different platforms
+
+## What is a Platform?
+* Target instruction set and pointer size
+* Target calling convention
+* Runtime data structures (not really covered here)
+* GC encoding
+ * So far only JIT32_GCENCODER and everything else
+* Debug info (so far mostly the same for all targets?)
+* EH info (not really covered here)
+
+One advantage of the CLR is that the VM (mostly) hides the (non-ABI) OS differences
+
+## The Very High Level View
+* 32 vs. 64 bits
+ * This work is not yet complete in the backend, but should be sharable
+* Instruction set architecture:
+ * instrsXXX.h, emitXXX.cpp and targetXXX.cpp
+ * lowerXXX.cpp
+ * codeGenXXX.cpp and simdcodegenXXX.cpp
+ * unwindXXX.cpp
+* Calling Convention: all over the place
+
+## Front-end changes
+* Calling Convention
+ * Struct args and returns seem to be the most complex differences
+ * Importer and morph are highly aware of these
+ * E.g. fgMorphArgs(), fgFixupStructReturn(), fgMorphCall(), fgPromoteStructs() and the various struct assignment morphing methods
+ * HFAs on ARM
+* Tail calls are target-dependent, but probably should be less so
+* Intrinsics: each platform recognizes different methods as intrinsics (e.g. Sin only for x86, Round everywhere BUT amd64)
+* Target-specific morphs such as for mul, mod and div
+
+## Backend Changes
+* Lowering: fully expose control flow and register requirements
+* Code Generation: traverse blocks in layout order, generating code (InstrDescs) based on register assignments on nodes
+ * Then, generate prolog & epilog, as well as GC, EH and scope tables
+* ABI changes:
+ * Calling convention register requirements
+ * Lowering of calls and returns
+ * Code sequences for prologs & epilogs
+ * Allocation & layout of frame
+
+## Target ISA "Configuration"
+* Conditional compilation (set in jit.h, based on incoming define, e.g. #ifdef X86)
+```C++
+_TARGET_64_BIT_ (32 bit target is just ! _TARGET_64BIT_)
+_TARGET_XARCH_, _TARGET_ARMARCH_
+_TARGET_AMD64_, _TARGET_X86_, _TARGET_ARM64_, _TARGET_ARM_
+```
+* Target.h
+* InstrsXXX.h
+
+## Instruction Encoding
+* The instrDesc is the data structure used for encoding
+ * It is initialized with the opcode bits, and has fields for immediates and register numbers.
+ * instrDescs are collected into groups
+ * A label may only occur at the beginning of a group
+* The emitter is called to:
+ * Create new instructions (instrDescs), during CodeGen
+ * Emit the bits from the instrDescs after CodeGen is complete
+ * Update Gcinfo (live GC vars & safe points)
+
+## Adding Encodings
+* The instruction encodings are captured in instrsXXX.h. These are the opcode bits for each instruction
+* The structure of each instruction's encoding is target-dependent
+* An "instruction" is just the representation of the opcode
+* An instance of "instrDesc" represents the instruction to be emitted
+* For each "type" of instruction, emit methods need to be implemented. These follow a pattern but a target may have unique ones, e.g.
+```C++
+emitter::emitInsMov(instruction ins, emitAttr attr, GenTree* node)
+emitter::emitIns_R_I(instruction ins, emitAttr attr, regNumber reg, ssize_t val)
+emitter::emitInsTernary(instruction ins, emitAttr attr, GenTree* dst, GenTree* src1, GenTree* src2) (currently Arm64 only)
+```
+
+## Lowering
+* Lowering ensures that all register requirements are exposed for the register allocator
+ * Use count, def count, "internal" reg count, and any special register requirements
+ * Does half the work of code generation, since all computation is made explicit
+ * But it is NOT necessarily a 1:1 mapping from lowered tree nodes to target instructions
+ * Its first pass does a tree walk, transforming the instructions. Some of this is target-independent. Notable exceptions:
+ * Calls and arguments
+ * Switch lowering
+ * LEA transformation
+ * Its second pass walks the nodes in execution order
+ * Sets register requirements
+ * sometimes changes the register requirements children (which have already been traversed)
+ * Sets the block order and node locations for LSRA
+ * LinearScan:: startBlockSequence() and LinearScan::moveToNextBlock()
+
+## Register Allocation
+* Register allocation is largely target-independent
+ * The second phase of Lowering does nearly all the target-dependent work
+* Register candidates are determined in the front-end
+ * Local variables or temps, or fields of local variables or temps
+ * Not address-taken, plus a few other restrictions
+ * Sorted by lvaSortByRefCount(), and marked "lvTracked"
+
+## Addressing Modes
+* The code to find and capture addressing modes is particularly poorly abstracted
+* genCreateAddrMode(), in CodeGenCommon.cpp traverses the tree looking for an addressing mode, then captures its constituent elements (base, index, scale & offset) in "out parameters"
+ * It optionally generates code
+ * For RyuJIT, it NEVER generates code, and is only used by gtSetEvalOrder, and by lowering
+
+## Code Generation
+* For the most part, the code generation method structure is the same for all architectures
+ * Most code generation methods start with "gen"
+* Theoretically, CodeGenCommon.cpp contains code "mostly" common to all targets (this factoring is imperfect)
+ * Method prolog, epilog,
+* genCodeForBBList
+ * walks the trees in execution order, calling genCodeForTreeNode, which needs to handle all nodes that are not "contained"
+ * generates control flow code (branches, EH) for the block
diff --git a/Documentation/botr/profilability.md b/Documentation/botr/profilability.md
new file mode 100644
index 0000000000..528c3f1e07
--- /dev/null
+++ b/Documentation/botr/profilability.md
@@ -0,0 +1,240 @@
+Implementing Profilability
+==========================
+
+This document describes technical details of adding profilability to a CLR feature. This is targeted toward devs who are modifying the profiling API so their feature can be profilable.
+
+Philosophy
+==========
+
+Contracts
+---------
+
+Before delving into the details on which contracts should be used in the profiling API, it's useful to understand the overall philosophy.
+
+A philosophy behind the default contracts movement throughout the CLR (outside of the profiling API) is to encourage the majority of the CLR to be prepared to deal with "aggressive behavior" like throwing or triggering. Below you'll see that this goes hand-in-hand with the recommendations for the callback (ICorProfilerCallback) contracts, which generally prefer the more permissive ("aggressive") of the contract choices. This gives the profiler the most flexibility in what it can do during its callback (in terms of which CLR calls it can make via ICorProfilerInfo).
+
+However, the Info functions (ICorProfilerInfo) below are just the opposite: they're preferred to be restrictive rather than permissive. Why? Because we want these to be safe for the profiler to call from as many places as possible, even from those callbacks that are more restrictive than we might like (e.g., callbacks that for some reason must be GC\_NOTRIGGER).
+
+Also, the preference for more restrictive contracts in ICorProfilerInfo doesn't contradict the overall CLR default contract philosophy, because it is expected that there will be a small minority of CLR functions that need to be restrictive. ICorProfilerInfo is the root of call paths that fall into this category. Since the profiler may be calling into the CLR at delicate times, we want these calls to be as unobtrusive as possible. These are not considered mainstream functions in the CLR, but are a small minority of special call paths that need to be careful.
+
+So the general guidance is to use default contracts throughout the CLR where possible. But when you need to blaze a path of calls originating from a profiler (i.e., from ICorProfilerInfo), that path will need to have its contracts explicitly specified, and be more restrictive than the default.
+
+Performance or ease of use?
+---------------------------
+
+Both would be nice. But if you need to make a trade-off, favor performance. The profiling API is meant to be a light-weight, thin, in-process layer between the CLR and a profiling DLL. Profiler writers are few and far between, and are mostly quite sophisticated developers. Simple validation of inputs by the CLR is expected. But we only go so far. For example, consider all the profiler IDs. They're just casted pointers of C++ EE object instances that are called into directly (AppDomain\*, MethodTable\*, etc.). A Profiler provides a bogus ID? The CLR AVs! This is expected. The CLR does not hash IDs, in order to validate a lookup . Profilers are assumed to know what they are doing.
+
+That said, I'll repeat: simple validation of inputs by the CLR is expected. Things like checking for NULL pointers, that classes requested for inspection have been initialized, "parallel parameters" are consistent (e.g., an array pointer parameter must be non-null if its size parameter is nonzero), etc.
+
+ICorProfilerCallback
+====================
+
+This interface comprises the callbacks made by the CLR into the profiler to notify the profiler of interesting events. Each callback is wrapped in a thin method in the EE that handles locating the profiler's implementation of ICorProfilerCallback(2), and calling its corresponding method.
+
+Profilers subscribe to events by specifying the corresponding flag in a call to ICorProfilerInfo::SetEventMask(). The profiling API stores these choices and exposes them to the CLR through specialized inline functions (CORProfiler\*) that mask against the bit corresponding to the flag. Then, sprinkled throughout the CLR, you'll see code that calls the ICorProfilerCallback wrapper to notify the profiler of events as they happen, but this call is conditional on the flag being set (determined by calling the specialized inline function):
+
+ {
+ //check if profiler set flag, pin profiler
+ BEGIN_PIN_PROFILER(CORProfilerTrackModuleLoads());
+
+ //call the wrapper around the profiler's callback implementation
+ g_profControlBlock.pProfInterface->ModuleLoadStarted((ModuleID) this);
+
+ //unpin profiler
+ END_PIN_PROFILER();
+ }
+
+To be clear, the code above is what you'll see sprinkled throughout the code base. The function it calls (in this case ModuleLoadStarted()) is our wrapper around the profiler's callback implementation (in this case ICorProfilerCallback::ModuleLoadStarted()). All of our wrappers appear in a single file (vm\EEToProfInterfaceImpl.cpp), and the guidance provided in the sections below relate to those wrappers; not to the above sample code that calls the wrappers.
+
+The macro BEGIN\_PIN\_PROFILER evaluates the expression passed as its argument. If the expression is TRUE, then the profiler is pinned into memory (meaning the profiler will not be able to detach from the process) and the code between the BEGIN\_PIN\_PROFILER and END\_PIN\_PROFILER macros is executed. If the expression is FALSE, all code between the BEGIN\_PIN\_PROFILER and END\_PIN\_PROFILER macros is skipped. For more information about the BEGIN\_PIN\_PROFILER and END\_PIN\_PROFILER macros, find their definition in the code base and read the comments there.
+
+Contracts
+---------
+
+Each and every callback wrapper must have some common gunk at the top. Here's an example:
+
+ CONTRACTL
+ {
+ // Yay!
+ NOTHROW;
+
+ // Yay!
+ GC_TRIGGERS;
+
+ // Yay!
+ MODE_PREEMPTIVE;
+
+ // Yay!
+ CAN_TAKE_LOCK;
+
+ // Yay!
+ ASSERT_NO_EE_LOCKS_HELD();
+ SO_NOT_MAINLINE;
+ }
+ CONTRACTL_END;
+ CLR_TO_PROFILER_ENTRYPOINT((LF_CORPROF,
+ LL_INFO10,
+ "**PROF: useful logging text here.\n"));
+
+Important points:
+
+- You must explicitly specify a value for the throws, triggers, mode, take\_lock, and ASSERT\_NO\_EE\_LOCKS\_HELD() (latter required on callbacks only). This allows us to keep our documentation for profiler-writers accurate.
+- Each contract must have its own comment (see below for specific details on contracts)
+
+There's a "preferred" value for each contract type. If possible, use that and comment it with "Yay!" so that others who copy / paste your code elsewhere will know what's best. If it's not possible to use the preferred value, comment why.
+
+Here are the preferred values for callbacks.
+
+| Preferred | Why | Details |
+| --------- | --- | ------- |
+| NOTHROW | Allows callback to be issued from any CLR context. Since Infos should be NOTHROW as well, this shouldn't be a hardship for the profiler. | Note that you will get throws violations if the profiler calls a THROWS Info function from here, even though the profiler encloses the call in a try/catch (because our contract system can't see the profiler's try/catch). So you'll need to insert a CONTRACT\_VIOLATION(ThrowsViolation) scoped just before the call into the profiler. |
+| GC\_TRIGGERS | Gives profiler the most flexibility in the Infos it can call. | If the callback is made at a delicate time where protecting all the object refs would be error-prone or significantly degrade performance, use GC\_NOTRIGGER (and comment of course!). |
+| MODE\_PREEMPTIVE if possible, otherwise MODE\_COOPERATIVE | MODE\_PREEMPTIVE gives profiler the most flexibility in the Infos it can call (except when coop is necessary due to ObjectIDs). Also, MODE\_PREEMPTIVE is a preferred "default" contract throughout the EE, and forcing callbacks to be in preemptive encourages use of preemptive elsewhere in the EE. | MODE\_COOPERATIVE is fair if you're passing ObjectID parameters to the profiler. Otherwise, specify MODE\_PREEMPTIVE. The caller of the callback should hopefully already be in preemptive mode anyway. If not, rethink why not and potentially change the caller to be in preemptive. Otherwise, you will need to use a GCX\_PREEMP() macro before calling the callback. |
+| CAN\_TAKE\_LOCK | Gives profiler the most flexibility in the Infos it can call | Nothing further, your honor. |
+| ASSERT\_NO\_EE\_LOCKS\_HELD() | Gives profiler even more flexibility on Infos it can call, as it ensures no Info could try to retake a lock or take an out-of-order lock (since no lock is taken to "retake" or destroy ordering) | This isn't actually a contract, though the contract block is a convenient place to put this, so you don't forget. As with the contracts, if this cannot be specified, comment why. |
+
+Note: EE\_THREAD\_NOT\_REQUIRED / EE\_THREAD\_REQUIRED need **not** be specified for callbacks. GC callbacks cannot specify "REQUIRED" anyway (no EE Thread might be present), and it is only interesting to consider these on the Info functions (profiler &#8594; CLR).
+
+Entrypoint macros
+-----------------
+
+As in the example above, after the contracts there should be an entrypoint macro. This takes care of logging, marking on the EE Thread object that we're in a callback, removing stack guard, and doing some asserts. There are a few variants of the macro you can use:
+
+ CLR_TO_PROFILER_ENTRYPOINT
+
+This is the preferred and typically-used macro.
+
+Other macro choices may be used **but you must comment** why the above (preferred) macro cannot be used.
+
+ *_FOR_THREAD_*
+
+These macros are used for ICorProfilerCallback methods that specify a ThreadID parameter whose value may not always be the _current_ ThreadID. You must specify the ThreadID as the first parameter to these macros. The macro will then use your ThreadID rather than GetThread(), to assert that the callback is currently allowed for that ThreadID (i.e., that we have not yet issued a ThreadDestroyed() for that ThreadID).
+
+ICorProfilerInfo
+================
+
+This interface comprises the entrypoints used by the profiler to call into the CLR.
+
+Synchronous / Asynchronous
+--------------------------
+
+Each Info call is classified as either synchronous or asynchronous. Synchronous functions must be called from within a callback, whereas asynchronous functions are safe to be called at any time.
+
+### Synchronous
+
+The vast majority of Info calls are synchronous: They can only be called by a profiler while it is executing inside a Callback. In other words, an ICorProfilerCallback must be on the stack for it to be legal to call a synchronous Info function. This is tracked by a bit on the EE Thread object. When a Callback is made, we set the bit. When the callback returns, we reset the bit. When a synchronous Info function is called, we test the bit—if it's not set, disallow the call.
+
+#### Threads without an EE Thread
+
+Because the above bit is tracked using the EE Thread object, only Info calls made on threads containing an EE Thread object have their "synchronous-ness" enforced. Any Info call made on a non-EE Thread thread is immediately considered legal. This is generally fine, as it's mainly the EE Thread threads that build up complex contexts that would be problematic to reenter. Also, it's ultimately the profiler's responsibility to ensure correctness. As described above, for performance reasons, the profiling API historically keeps its correctness checks down to a bare minimum, so as not to increase the weight. Typically, Info calls made by a profiler on a non-EE Thread fall into these categories:
+
+- An Info call made during a GC callback on a thread doing a server.
+- An Info call made on a thread of the profiler's creation, such as a sampling thread (which therefore would have no CLR code on the stack).
+
+#### Enter / leave hooks
+
+If a profiler requests enter / leave hooks and uses the fast path (i.e., direct function calls from the jitted code to the profiler with no intervening profiling API code), then any call to an Info function from within its enter / leave hooks will be considered asynchronous. Again, this is for pragmatic reasons. If profiling API code doesn't get a chance to run (for performance), then we have no opportunity to set the EE Thread bit stating that we're executing inside a callback. This means a profiler is restricted to calling only asynchronous-safe Info functions from within its enter / leave hook. This is typically acceptable, as a profiler concerned enough with perf that it requires direct function calls for enter / leave will probably not be calling any Info functions from within its enter / leave hooks anyway.
+
+The alternative is for the profiler to set a flag specifying that it wants argument or return value information, which forces an intervening profiling API C function to be called to prepare the information for the profiler's Enter / Leave hooks. When such a flag is set, the profiling API sets the EE Thread bit from inside this C function that prepares the argument / return value information from the profiler. This enables the profiler to call synchronous Info functions from within its Enter / Leave hook.
+
+### Asynchronous
+
+Asynchronous Info functions are those that are safe to be called anytime (from a callback or not). There are relatively few asynchronous Info functions. They are what a hijacking sampling profiler (e.g., Visual Studio profiler) might want to call from within one of its samples. It is critical that an Info function labeled as asynchronous be able to execute from any possible call stack. A thread could be interrupted while holding any number of locks (spin locks, thread store lock, OS heap lock, etc.), and then forced by the profiler to reenter the runtime via an asynchronous Info function. This can easily cause deadlock or data corruption. There are two ways an asynchronous Info function can ensure its own safety:
+
+- Be very, very simple. Don't take locks, don't trigger a GC, don't access data that could be inconsistent, etc. OR
+- If you need to be more complex than that, have sufficient checks at the top to ensure locks, data structures, etc., are in a safe state before proceeding.
+ - Often, this includes asking whether the current thread is currently inside a forbid suspend thread region, and bailing with an error if it is, though this is not a sufficient check in all cases.
+ - DoStackSnapshot is an example of a complex asynchronous function. It uses a combination of checks (including asking whether the current thread is currently inside a forbid suspend thread region) to determine whether to proceed or bail.
+
+Contracts
+---------
+
+Each and every Info function must have some common gunk at the top. Here's an example:
+
+ CONTRACTL
+ {
+ // Yay!
+ NOTHROW;
+
+ // Yay!
+ GC_NOTRIGGER;
+
+ // Yay!
+ MODE_ANY;
+
+ // Yay!
+ EE_THREAD_NOT_REQUIRED;
+
+ // Yay!
+ CANNOT_TAKE_LOCK;
+ SO_NOT_MAINLINE;
+ }
+ CONTRACTL_END;
+ PROFILER_TO_CLR_ENTRYPOINT_SYNC((LF_CORPROF,
+ LL_INFO1000,
+ "**PROF: EnumModuleFrozenObjects 0x%p.\n",
+ moduleID));
+
+Here are the "preferred" values for each contract type. Note these are mostly different from the preferred values for Callbacks! If that confuses you, reread section 2.
+
+| Preferred | Why | Details |
+| --------- | --- | ------- |
+| NOTHROW | Makes it easier for profiler to call; profiler doesn't need its own try / catch. | If your callees are NOTHROW then use NOTHROW. Otherwise, it's actually better to mark yourself as THROWS than to set up your own try / catch. The profiler can probably do this more efficiently by sharing a try block among multiple Info calls. |
+| GC\_NOTRIGGER | Safer for profiler to call from more situations | Go out of your way not to trigger. If an Info function _might_ trigger (e.g., loading a type if it's not already loaded), ensure there's a way, if possible, for the profiler to specify _not_ to take the trigger path (e.g., fAllowLoad parameter that can be set to FALSE), and contract that conditionally. |
+| MODE\_ANY | Safer for profiler to call from more situations | MODE\_COOPERATIVE is fair if your parameters or returns are ObjectIDs. Otherwise, MODE\_ANY is strongly preferred. |
+| CANNOT\_TAKE\_LOCK | Safer for profiler to call from more situations | Ensure your callees don't lock. If they must, comment exactly what locks are taken. |
+| Optional:EE\_THREAD\_NOT\_REQUIRED | Allows profiler to use this Info fcn from GC callbacks and from profiler-spun threads (e.g., sampling thread). | These contracts are not yet enforced, so it's fine to just leave it blank. If you're pretty sure your Info function doesn't need (or call anyone who needs) a current EE Thread, you can specify EE\_THREAD\_NOT\_REQUIRED as a hint for later when the thread contracts are enforced. |
+
+Here's an example of commented contracts in a function that's not as "yay" as the one above:
+
+ CONTRACTL
+ {
+ // ModuleILHeap::CreateNew throws
+ THROWS;
+
+ // AppDomainIterator::Next calls AppDomain::Release which can destroy AppDomain, and
+ // ~AppDomain triggers, according to its contract.
+ GC_TRIGGERS;
+
+ // Need cooperative mode, otherwise objectId can become invalid
+ if (GetThreadNULLOk() != NULL) { MODE_COOPERATIVE; }
+
+ // Yay!
+ EE_THREAD_NOT_REQUIRED;
+
+ // Generics::GetExactInstantiationsFromCallInformation eventually
+ // reads metadata which causes us to take a reader lock.
+ CAN_TAKE_LOCK;
+ }
+ CONTRACTL_END;
+
+Entrypoint macros
+-----------------
+
+After the contracts, there should be an entrypoint macro. This takes care of logging and, in the case of a synchronous function, consulting callback state flags to enforce it's really called synchronously. Use one of these, depending on whether the Info function is synchronous, asynchronous, or callable only from within the Initialize callback:
+
+- PROFILER\_TO\_CLR\_ENTRYPOINT\_**SYNC** _(typical choice)_
+- PROFILER\_TO\_CLR\_ENTRYPOINT\_**ASYNC**
+- PROFILER\_TO\_CLR\_ENTRYPOINT\_CALLABLE\_ON\_INIT\_ONLY
+
+As described above, asynchronous Info methods are rare and carry a higher burden. The preferred contracts above are even "more preferred" if the method is asynchronous, and these 2 are outright required: GC\_NOTRIGGER & MODE\_ANY. CANNOT\_TAKE\_LOCK, while even more preferred in an async than sync function, is not always possible. See _Asynchronous_ section above for what to do.
+
+Files You'll Modify
+===================
+
+It's pretty straightforward where to go, to add or modify methods, and code inspection is all you'll need to figure it out. Here are the places you'll need to go.
+
+corprof.idl
+-----------
+
+All profiling API interfaces and types are defined in [src\inc\corprof.idl](https://github.com/dotnet/coreclr/blob/master/src/inc/corprof.idl). Go here first to define your types and methods.
+
+EEToProfInterfaceImpl.\*
+-----------------------
+
+Wrapper around the profiler's implementation of ICorProfilerCallback is located at [src\vm\EEToProfInterfaceImpl.\*](https://github.com/dotnet/coreclr/tree/master/src/vm).
+
+ProfToEEInterfaceImpl.\*
+-----------------------
+
+Implementation of ICorProfilerInfo is located at [src\vm\ProfToEEInterfaceImpl.\*](https://github.com/dotnet/coreclr/tree/master/src/vm).
diff --git a/Documentation/botr/profiling.md b/Documentation/botr/profiling.md
new file mode 100644
index 0000000000..b83f78b5bd
--- /dev/null
+++ b/Documentation/botr/profiling.md
@@ -0,0 +1,513 @@
+Profiling
+=========
+
+Profiling, in this document, means monitoring the execution of a program which is executing on the Common Language Runtime (CLR). This document details the interfaces, provided by the Runtime, to access such information.
+
+Although it is called the Profiling API, the functionality provided by it is suitable for use by more than just traditional profiling tools. Traditional profiling tools focus on measuring the execution of the program—time spent in each function, or memory usage of the program over time. However, the profiling API is really targeted at a broader class of diagnostic tools, such as code-coverage utilities or even advanced debugging aids.
+
+The common thread among all of these uses is that they are all diagnostic in nature — the tool is written to monitor the execution of a program. The Profiling API should never be used by the program itself, and the correctness of the program's execution should not depend on (or be affected by) having a profiler active against it.
+
+Profiling a CLR program requires more support than profiling conventionally compiled machine code. This is because the CLR has concepts such as application domains, garbage collection, managed exception handling and JIT compilation of code (converting Intermediate Language into native machine code), that the existing conventional profiling mechanisms are unable to identify and provide useful information. The Profiling API provides this missing information in an efficient way that causes minimal impact on the performance of the CLR and the profiled program.
+
+Note that JIT-compiling routines at runtime provide good opportunities, as the API allows a profiler to change the in-memory IL code stream for a routine, and then request that it be JIT-compiled anew. In this way, the profiler can dynamically add instrumentation code to particular routines that need deeper investigation. Although this approach is possible in conventional scenarios, it's much easier to do this for the CLR.
+
+Goals for the Profiling API
+===========================
+
+- Expose information that existing profilers will require for a user to determine and analyze performance of a program run on the CLR. Specifically:
+
+ - Common Language Runtime startup and shutdown events
+ - Application domain creation and shutdown events
+ - Assembly loading and unloading events
+ - Module load/unload events
+ - Com VTable creation and destruction events
+ - JIT-compiles, and code pitching events
+ - Class load/unload events
+ - Thread birth/death/synchronization
+ - Function entry/exit events
+ - Exceptions
+ - Transitions between managed and unmanaged execution
+ - Transitions between different Runtime _contexts_
+ - Information about Runtime suspensions
+ - Information about the Runtime memory heap and garbage collection activity
+
+- Callable from any (non-managed) COM-compatible language
+- Efficient, in terms of CPU and memory consumption - the act of profiling should not cause such a big change upon the program being profiled that the results are misleading
+- Useful to both _sampling_ and _non-sampling_ profilers. [A _sampling _profiler inspects the profilee at regular clock ticks - maybe 5 milliseconds apart, say. A _non-sampling _profiler is informed of events, synchronously with the thread that causes them]
+
+Non-goals for the Profiling API
+===============================
+
+- The Profiling API does **not** support profiling unmanaged code. Existing mechanisms must instead be used to profile unmanaged code. The CLR profiling API works only for managed code. However, profiler provides managed/unmanaged transition events to determine the boundaries between managed and unmanaged code.
+- The Profiling API does **not** support writing applications that will modify their own code, for purposes such as aspect-oriented programming.
+- The Profiling API does **not** provide information needed to check bounds. The CLR provides intrinsic support for bounds checking of all managed code.
+
+The CLR code profiler interfaces do not support remote profiling due to the following reasons:
+
+- It is necessary to minimize execution time using these interfaces so that profiling results will not be unduly affected. This is especially true where execution performance is being monitored. However, it is not a limitation when the interfaces are used to monitor memory usage or to obtain Runtime information on stack frames, objects, etc.
+- The code profiler needs to register one or more callback interfaces with the Runtime on the local machine on which the application being profiled runs. This limits the ability to create a remote code profiler.
+
+Profiling API – Overview
+========================
+
+The profiling API within CLR allows the user to monitor the execution and memory usage of a running application. Typically, this API will be used to write a code profiler package. In the sections that follow, we will talk about a profiler as a package built to monitor execution of _any_ managed application.
+
+The profiling API is used by a profiler DLL, loaded into the same process as the program being profiled. The profiler DLL implements a callback interface (ICorProfilerCallback2). The runtime calls methods on that interface to notify the profiler of events in the profiled process. The profiler can call back into the runtime with methods on ICorProfilerInfo to get information about the state of the profiled application.
+
+Note that only the data-gathering part of the profiler solution should be running in-process with the profiled application—UI and data analysis should be done in a separate process.
+
+![Profiling Process Overview]: images/profiling-overview.png
+
+The _ICorProfilerCallback_ and _ICorProfilerCallback2 _interfaces consists of methods with names like ClassLoadStarted, ClassLoadFinished, JITCompilationStarted. Each time the CLR loads/unloads a class, compiles a function, etc., it calls the corresponding method in the profiler's _ICorProfilerCallback/ICorProfilerCallback2_ interface. (And similarly for all of the other notifications; see later for details)
+
+So, for example, a profiler could measure code performance via the two notifications FunctionEnter and FunctionLeave. It simply timestamps each notification, accumulates results, then outputs a list indicating which functions consumed the most cpu time, or most wall-clock time, during execution of the application.
+
+The _ICorProfilerCallback/ICorProfilerCallback2_ interface can be considered to be the "notifications API".
+
+The other interface involved for profiling is _ICorProfilerInfo_. The profiler calls this, as required, to obtain more information to help its analysis. For example, whenever the CLR calls FunctionEnter it supplies a value for the FunctionId. The profiler can discover more information about that FunctionId by calling the _ICorProfilerInfo::GetFunctionInfo_ to discover the function's parent class, its name, etc, etc.
+
+The picture so far describes what happens once the application and profiler are running. But how are the two connected together when an application is started? The CLR makes the connection during its initialization in each process. It decides whether to connect to a profiler, and which profiler that should be, depending upon the value for two environment variables, checked one after the other:
+
+- Cor\_Enable\_Profiling - only connect with a profiler if this environment variable exists and is set to a non-zero value.
+- Cor\_Profiler - connect with the profiler with this CLSID or ProgID (which must have been stored previously in the Registry). The Cor\_Profiler environment variable is defined as a string:
+ - set Cor\_Profiler={32E2F4DA-1BEA-47ea-88F9-C5DAF691C94A}, or
+ - set Cor\_Proflier="MyProfiler"
+- The profiler class is the one that implements _ICorProfilerCallback/ICorProfilerCallback2_. It is required that a profiler implement ICorProfilerCallback2; if it does not, it will not be loaded.
+
+When both checks above pass, the CLR creates an instance of the profiler in a similar fashion to _CoCreateInstance_. The profiler is not loaded through a direct call to _CoCreateInstance_ so that a call to _CoInitialize_ may be avoided, which requires setting the threading model. It then calls the _ICorProfilerCallback::Initialize_ method in the profiler. The signature of this method is:
+
+ HRESULT Initialize(IUnknown \*pICorProfilerInfoUnk)
+
+The profiler must QueryInterface pICorProfilerInfoUnk for an _ICorProfilerInfo_ interface pointer and save it so that it can call for more info during later profiling. It then calls ICorProfilerInfo::SetEventMask to say which categories of notifications it is interested in. For example:
+
+ ICorProfilerInfo\* pInfo;
+
+ pICorProfilerInfoUnk->QueryInterface(IID\_ICorProfilerInfo, (void\*\*)&pInfo);
+
+ pInfo->SetEventMask(COR\_PRF\_MONITOR\_ENTERLEAVE | COR\_PRF\_MONITOR\_GC)
+
+This mask would be used for a profiler interested only in function enter/leave notifications and garbage collection notifications. The profiler then simply returns, and is off and running!
+
+By setting the notifications mask in this way, the profiler can limit which notifications it receives. This obviously helps the user build a simpler, or special-purpose profiler; it also reduces wasted cpu time in sending notifications that the profiler would simply 'drop on the floor' (see later for details).
+
+TODO: This text is a bit confusing. It seems to be conflating the fact that you need to create a different 'environment' (as in environment variables) to specify a different profiler and the fact that only one profiler can attach to a process at once. It may also be conflating launch vs. attach scenarios. Is that right??
+
+Note that only one profiler can be profiling a process at one time in a given environment. In different environments it is possible to have two different profilers registered in each environment, each profiling separate processes.
+
+Certain profiler events are IMMUTABLE which means that once they are set in the _ICorProfilerCallback::Initialize_ callback they cannot be turned off using ICorProfilerInfo::SetEventMask(). Trying to change an immutable event will result in SetEventMask returning a failed HRESULT.
+
+The profiler must be implemented as an inproc COM server – a DLL, which is mapped into the same address space as the process being profiled. Any other type of COM server is not supported; if a profiler, for example, wants to monitor applications from a remote computer, it must implement 'collector agents' on each machine, which batch results and communicate them to the central data collection machine.
+
+Profiling API – Recurring Concepts
+==================================
+
+This brief section explains a few concepts that apply throughout the profiling API, rather than repeat them with the description of each method.
+
+IDs
+---
+
+Runtime notifications supply an ID for reported classes, threads, AppDomains, etc. These IDs can be used to query the Runtime for more info. These IDs are simply the address of a block in memory that describes the item; however, they should be treated as opaque handles by any profiler. If an invalid ID is used in a call to any Profiling API function then the results are undefined. Most likely, the result will be an access violation. The user has to ensure that the ID's used are perfectly valid. The profiling API does not perform any type of validation since that would create overhead and it would slow down the execution considerably.
+
+### Uniqueness
+
+A ProcessID is unique system-wide for the lifetime of the process. All other IDs are unique process-wide for the lifetime of the ID.
+
+### Hierarchy & Containment
+
+ID's are arranged in a hierarchy, mirroring the hierarchy in the process. Processes contain AppDomains contain Assemblies contain Modules contain Classes contain Functions. Threads are contained within Processes, and may move from AppDomain to AppDomain. Objects are mostly contained within AppDomains (a very few objects may be members of more than one AppDomain at a time). Contexts are contained within Processes.
+
+### Lifetime & Stability
+
+When a given ID dies, all IDs contained within it die.
+
+ProcessID – Alive and stable from the call to Initialize until the return from Shutdown.
+
+AppDomainID – Alive and stable from the call to AppDomainCreationFinished until the return from AppDomainShutdownStarted.
+
+AssemblyID, ModuleID, ClassID – Alive and stable from the call to LoadFinished for the ID until the return from UnloadStarted for the ID.
+
+FunctionID – Alive and stable from the call to JITCompilationFinished or JITCachedFunctionSearchFinished until the death of the containing ClassID.
+
+ThreadID – Alive and stable from the call to ThreadCreated until the return from ThreadDestroyed.
+
+ObjectID – Alive beginning with the call to ObjectAllocated. Eligible to change or die with each garbage collection.
+
+GCHandleID – Alive from the call to HandleCreated until the return from HandleDestroyed.
+
+In addition, any ID returned from a profiling API function will be alive at the time it is returned.
+
+### App-Domain Affinity
+
+There is an AppDomainID for each user-created app-domain in the process, plus the "default" domain, plus a special pseudo-domain used for holding domain-neutral assemblies.
+
+Assembly, Module, Class, Function, and GCHandleIDs have app-domain affinity, meaning that if an assembly is loaded into multiple app domains, it (and all of the modules, classes, and functions contained within it) will have a different ID in each, and operations upon each ID will take effect only in the associated app domain. Domain-neutral assemblies will appear in the special pseudo-domain mentioned above.
+
+### Special Notes
+
+All IDs except ObjectID should be treated as opaque values. Most IDs are fairly self-explanatory. A few are worth explaining in more detail:
+
+**ClassIDs** represent classes. In the case of generic classes, they represent fully-instantiated types. List<int>, List<char>, List<object>, and List<string> each have their own ClassID. List<T> is an uninstantiated type, and has no ClassID. Dictionary<string,V> is a partially-instantiated type, and has no ClassID.
+
+**FunctionIDs** represent native code for a function. In the case of generic functions (or functions on generic classes), there may be multiple native code instantiations for a given function, and thus multiple FunctionIDs. Native code instantiations may be shared between different types — for example List<string> and List<object> share all code—so a FunctionID may "belong" to more than one ClassID.
+
+**ObjectIDs** represent garbage-collected objects. An ObjectID is the current address of the object at the time the ObjectID is received by the profiler, and may change with each garbage collection. Thus, an ObjectID value is only valid between the time it is received and when the next garbage collection begins. The CLR also supplies notifications that allow a profiler to update its internal maps that track objects, so that a profiler may maintain a valid ObjectID across garbage collections.
+
+**GCHandleIDs** represent entries in the GC's handle table. GCHandleIDs, unlike ObjectIDs, are opaque values. GC handles are created by the runtime itself in some situations, or can be created by user code using the System.Runtime.InteropServices.GCHandle structure. (Note that the GCHandle structure merely represents the handle; the handle does not "live" within the GCHandle struct.)
+
+**ThreadIDs** represent managed threads. If a host supports execution in fiber mode, a managed thread may exist on different OS threads, depending on when it is examined. ( **NOTE:** Profiling of fiber-mode applications is not supported.)
+
+Callback Return Values
+----------------------
+
+A profiler returns a status, as an HRESULT, for each notification triggered by the CLR. That status may have the value S\_OK or E\_FAIL. Currently the Runtime ignores this status value in every callback except ObjectReferences.
+
+Caller-Allocated Buffers
+------------------------
+
+ICorProfilerInfo functions that take caller-allocated buffers typically conform to the following signature:
+
+ HRESULT GetBuffer( [in] /\* Some query information \*/,
+ [in] ULONG32 cBuffer,
+ [out] ULONG32 \*pcBuffer,
+ [out, size\_is(cBuffer), length\_is(\*pcMap)] /\* TYPE \*/ buffer[] );
+
+These functions will always behave as follows:
+
+- cBuffer is the number of elements allocated in the buffer.
+- \*pcBuffer will be set to the total number of elements available.
+- buffer will be filled with as many elements as possible
+
+If any elements are returned, the return value will be S\_OK. It is the caller's responsibility to check if the buffer was large enough.
+
+If buffer is NULL, cBuffer must be 0. The function will return S\_OK and set \*pcBuffer to the total number of elements available.
+
+Optional Out Parameters
+-----------------------
+
+All [out] parameters on the API are optional, unless a function has only one [out] parameter. A profiler simply passes NULL for any [out] parameters it is not interested in. The profiler must also pass consistent values for any associated [in] parameters—e.g., if the NULL [out] parameter is a buffer to be filled with data, the [in] parameter specifying its size must be 0.
+
+Notification Thread
+-------------------
+
+In most cases, the notifications are executed by the same thread as generated the event. Such notifications (for example, FunctionEnter and FunctionLeave_)_ don't need to supply the explicit ThreadID. Also, the profiler might choose to use thread-local storage to store and update its analysis blocks, as compared with indexing into global storage, based off the ThreadID of the affected thread.
+
+Each notification documents which thread does the call – either the thread which generated the event or some utility thread (e.g. garbage collector) within the Runtime. For any callback that might be invoked by a different thread, a user can call the _ICorProfilerInfo::GetCurrentThreadID_ to discover the thread that generated the event.
+
+Note that these callbacks are not serialized. The profiler developer must write defensive code, by creating thread safe data structures and by locking the profiler code where necessary to prevent parallel access from multiple threads. Therefore, in certain cases it is possible to receive an unusual sequence of callbacks. For example assume a managed application is spawning two threads, which are executing identical code. In this case it is possible to receive a JITCompilationStarted event for some function from one thread and before receiving the respective JITCompilationFinished callback, the other thread has already sent a FunctionEnter callback. Therefore the user will receive a FunctionEnter callback for a function that it seems not fully JIT compiled yet!
+
+GC-Safe Callouts
+----------------
+
+When the CLR calls certain functions in the _ICorProfilerCallback_, the Runtime cannot perform a garbage collection until the Profiler returns control from that call. This is because profiling services cannot always construct the stack into a state that is safe for a garbage collection; instead garbage collection is disabled around that callback. For these cases, the Profiler should take care to return control as soon as possible. The callbacks where this applies are:
+
+- FunctionEnter, FunctionLeave, FunctionTailCall
+- ExceptionOSHandlerEnter, ExceptionOSHandlerLeave
+- ExceptionUnwindFunctionEnter, ExceptionUnwindFunctionLeave
+- ExceptionUnwindFinallyEnter, ExceptionUnwindFinallyLeave
+- ExceptionCatcherEnter, ExceptionCatcherLeave
+- ExceptionCLRCatcherFound, ExceptionCLRCatcherExecute
+- COMClassicVTableCreated, COMClassicVTableDestroyed
+
+In addition, the following callbacks may or may not allow the Profiler to block. This is indicated, call-by-call, via the fIsSafeToBlockargument. This set includes:
+
+- JITCompilationStarted, JITCompilationFinished
+
+Note that if the Profiler _does _block, it will delay garbage collection. This is harmless, as long as the Profiler code itself does not attempt to allocate space in the managed heap, which could induce deadlock.
+
+Using COM
+---------
+
+Though the profiling API interfaces are defined as COM interfaces, the runtime does not actually initialize COM in order to use them. This is in order to avoid having to set the threading model via CoInitialize before the managed application has had a chance to specify its desired threading model. Similarly, the profiler itself should not call CoInitialize, since it may pick a threading model that is incompatible with the application being profiled and therefore break the app.
+
+Callbacks and Stack Depth
+-------------------------
+
+Profiler callbacks may be issued in extremely stack-constrained circumstances, and a stack overflow within a profiler callback will lead to immediate process exit. A profiler should be careful to use as little stack as possible in response to callbacks. If the profiler is intended for use against processes that are robust against stack overflow, the profiler itself should also avoid triggering stack overflow.
+
+How to profile a NT Service
+---------------------------
+
+Profiling is enabled through environment variables, and since NT Services are started when the Operating System boots, those environment variables must be present and set to the required value at that time. Thus, to profile an NT Service, the appropriate environment variables must be set in advance, system-wide, via:
+
+MyComputer -> Properties -> Advanced -> EnvironmentVariables -> System Variables
+
+Both **Cor\_Enable\_Profiling** and **COR\_PROFILER have to be set** , and the user must ensure that the Profiler DLL is registered. Then, the target machine should be re-booted so that the NT Services pick up those changes. Note that this will enable profiling on a system-wide basis. So, to prevent every managed application that is run subsequently from being profiled, the user should delete those system environment variables after the re-boot.
+
+Profiling API – High-Level Description
+======================================
+
+Loader Callbacks
+----------------
+
+The loader callbacks are those issued for app domain, assembly, module, and class loading.
+
+One might expect that the CLR would notify an assembly load, followed by one or more module loads for that assembly. However, what actually happens depends on any number of factors within the implementation of the loader. The profiler may depend on the following:
+
+- A Started callback will be delivered before the Finished callback for the same ID.
+- Started and Finished callbacks will be delivered on the same thread.
+
+Though the loader callbacks are arranged in Started/Finished pairs, they cannot be used to accurately attribute time to operations within the loader.
+
+Call stacks
+-----------
+
+The profiling API provides two ways of obtaining call stacks—a snapshot method, suitable for sparse gathering of callstacks, and a shadow-stack method, suitable for tracking the callstack at every instant.
+
+### Stack Snapshot
+
+A stack snapshot is a trace of the stack of a thread at an instant in time. The profiling API provides support for tracing the managed functions on the stack, but leaves the tracing of unmanaged functions to the profiler's own stack walker.
+
+### Shadow Stack
+
+Using the above snapshot method too frequently can quickly become a performance issue. When stack traces need to be taken often, profilers should instead build a "shadow stack" using the FunctionEnter, FunctionLeave, FunctionTailCall, and Exception\* callbacks. The shadow stack is always current and can be quickly copied to storage whenever a stack snapshot is needed.
+
+A shadow stack may obtain function arguments, return values, and information about generic instantiations. This information is only available through the shadow stack, because it's readily available at function-enter time but may have been optimized away later in the run of the function.
+
+Garbage Collection
+------------------
+
+When the profiler specifies the COR\_PRF\_MONITOR\_GC flag, all the GC events will be triggered in the profiler except the _ICorProfilerCallback::ObjectAllocated_ events. They are explicitly controlled by another flag (see next section), for performance reasons. Note that when the COR\_PRF\_MONITOR\_GC is enabled, the Concurrent Garbage Collection is turned off.
+
+A profiler may use the GarbageCollectionStarted/Finished callbacks to identify that a GC is taking place, and which generations are covered.
+
+### Tracking Moved Objects
+
+Garbage collection reclaims the memory occupied by 'dead' objects and compacts that freed space. As a result, live objects are moved within the heap. The effect is that _ObjectIDs_ handed out by previous notifications change their value (the internal state of the object itself does not change (other than its references to other objects), just its location in memory, and therefore its _ObjectID_). The _MovedReferences_ notification lets a profiler update its internal tables that are tracking info by _ObjectID_. Its name is somewhat misleading, as it is issued even for objects that were not moved.
+
+The number of objects in the heap can number thousands or millions. With such large numbers, it's impractical to notify their movement by providing a before-and-after ID for each object. However, the garbage collector tends to move contiguous runs of live objects as a 'bunch' – so they end up at new locations in the heap, but they are still contiguous. This notification reports the "before" and "after" _ObjectID_ of these contiguous runs of objects. (see example below)
+
+In other words, if an _ObjectID_ value lies within the range:
+
+ _oldObjectIDRangeStart[i] <= ObjectID < oldObjectIDRangeStart[i] + cObjectIDRangeLength[i]_
+
+ for _0 <= i < cMovedObjectIDRanges_, then the _ObjectID_ value has changed to
+
+ _ObjectID - oldObjectIDRangeStart[i] + newObjectIDRangeStart[i]_
+
+All of these callbacks are made while the Runtime is suspended, so none of the _ObjectID_ values can change until the Runtime resumes and another GC occurs.
+
+**Example:** The diagram below shows 10 objects, before garbage collection. They lie at start addresses (equivalent to _ObjectIDs_) of 08, 09, 10, 12, 13, 15, 16, 17, 18 and 19. _ObjectIDs_ 09, 13 and 19 are dead (shown shaded); their space will be reclaimed during garbage collection.
+
+![Garbage Collection]: profiling-gc.png
+
+The "After" picture shows how the space occupied by dead objects has been reclaimed to hold live objects. The live objects have been moved in the heap to the new locations shown. As a result, their _ObjectIDs_ all change. The simplistic way to describe these changes is with a table of before-and-after _ObjectIDs_, like this:
+
+| | oldObjectIDRangeStart[] | newObjectIDRangeStart[] |
+|:--:|:-----------------------:|:-----------------------:|
+| 0 | 08 | 07 |
+| 1 | 09 | |
+| 2 | 10 | 08 |
+| 3 | 12 | 10 |
+| 3 | 13 | |
+| 4 | 15 | 11 |
+| 5 | 16 | 12 |
+| 6 | 17 | 13 |
+| 7 | 18 | 14 |
+| 8 | 19 | |
+
+This works, but clearly, we can compact the information by specifying starts and sizes of contiguous runs, like this:
+
+| | oldObjectIDRangeStart[] | newObjectIDRangeStart[] | cObjectIDRangeLength[] |
+|:--:|:-----------------------:|:-----------------------:|:----------------------:|
+| 0 | 08 | 07 | 1 |
+| 1 | 10 | 08 | 3 |
+| 2 | 15 | 11 | 4 |
+
+This corresponds to exactly how _MovedReferences_ reports the information. Note that _MovedReferencesCallback_ is reporting the new layout of the object BEFORE they actually get relocated in the heap. So the old _ObjectIDs_ are still valid for calls to the _ICorProfilerInfo_ interface (and the new _ObjectIDs_ are not).
+
+#### Detecting All Deleted Objects
+
+MovedReferences will report all objects that survive a compacting GC, regardless of whether they move; anything not reported did not survive. However not all GC's are compacting.
+
+The profiler may call ICorProfilerInfo2::GetGenerationBounds to get the boundaries of the GC heap segments. The rangeLength field in the resulting COR\_PRF\_GC\_GENERATION\_RANGE structs can be used to figure out the extent of live objects in a compacted generation.
+
+The GarbageCollectionStarted callback indicates which generations are being collected by the current GC. All objects that are in a generation that is not being collected will survive the GC.
+
+For a non-compacting GC (a GC in which no objects get moved at all), the SurvivingReferences callback is delivered to indicate which objects survived the GC.
+
+Note that a single GC may be compacting for one generation and non-compacting for another. Any given generation will receive either SurvivingReferences callbacks or MovedReferences callbacks for a given GC, but not both.
+
+#### Remarks
+
+The application is halted following a garbage collection until the Runtime is done passing information about the heap to the code profiler. The method _ICorProfilerInfo::GetClassFromObject_ can be used to obtain the _ClassID_ of the class of which the object is an instance. The method _ICorProfilerInfo::GetTokenFromClass_ can be used to obtain metadata information about the class.
+
+RootReferences2 allows the profiler to identify objects held via special handles. The generation bounds information supplied by GetGenerationBounds combined with the collected-generation information supplied by GarbageCollectionStarted enable the profiler to identify objects that live in generations that were not collected.
+
+Object Inspection
+-----------------
+
+The FunctionEnter2/Leave2 callbacks provide information about the arguments and return value of a function, as regions of memory. The arguments are stored left-to-right in the given memory regions. A profiler can use the metadata signature of the function to interpret the arguments, as follows:
+
+| **ELEMENT\_TYPE** | **Representation** |
+| -------------------------------------- | -------------------------- |
+| Primitives (ELEMENT\_TYPE <= R8, I, U) | Primitive values |
+| Value types (VALUETYPE) | Depends on type |
+| Reference types (CLASS, STRING, OBJECT, ARRAY, GENERICINST, SZARRAY) | ObjectID (pointer into GC heap) |
+| BYREF | Managed pointer (NOT an ObjectID, but may be pointing to stack or GC heap) |
+| PTR | Unmanaged pointer (not movable by GC) |
+| FNPTR | Pointer-sized opaque value |
+| TYPEDBYREF | Managed pointer, followed by a pointer-sized opaque value |
+
+The differences between an ObjectID and a managed pointer are:
+
+- ObjectID's only point into the GC heap or frozen object heap. Managed pointers may point to the stack as well.
+- ObjectID's always point to the beginning of an object. Managed pointers may point to one of its fields.
+- Managed pointers cannot be passed to functions that expect an ObjectID
+
+### Inspecting Complex Types
+
+Inspecting reference types or non-primitive value types requires some advanced techniques.
+
+For value types and reference types other than strings or arrays, GetClassLayout provides the offset for each field. The profiler can then use the metadata to determine the type of the field and recursively evaluate it. (Note that GetClassLayout returns only the fields defined by the class itself; fields defined by the parent class are not included.)
+
+For boxed value types, GetBoxClassLayout provides the offset of the value type within the box. The layout of the value type itself does not change, so once the profiler has found the value type within the box, it can use GetClassLayout to understand its layout.
+
+For strings, GetStringClassLayout provides the offsets of interesting pieces of data in the string object.
+
+Arrays are somewhat special, in that to understand arrays a function must be called for every array object, rather than just for the type. (This is because there are too many formats of arrays to describe using offsets.) GetArrayObjectInfo is provided to do the interpretation.
+
+@TODO: Callbacks from which inspection is safe
+
+@TODO: Functions that are legal to call when threads are hard-suspended
+
+### Inspecting Static Fields
+
+GetThreadStaticAddress, GetAppDomainStaticAddress, GetContextStaticAddress, and GetRVAStaticAddress provide information about the location of static fields. Looking at the memory at that location, you interpret it as follows:
+
+- Reference types: ObjectID
+- Value types: ObjectID of box containing the actual value
+- Primitive types: Primitive value
+
+There are four types of statics. The following table describes what they are and how to identify them.
+
+| **Static Type** | **Definition** | **Identifying in Metadata** |
+| --------------- | -------------- | --------------------------- |
+| AppDomain | Your basic static field—has a different value in each app domain. | Static field with no attached custom attributes |
+| Thread | Managed TLS—a static field with a unique value for each thread and each app domain. | Static field with System.ThreadStaticAttribute |
+| RVA | Process-scoped static field with a home in the module's data section | Static field with hasRVA flag |
+| Context | Static field with a different value in each COM+ Context | Static field with System.ContextStaticAttribute |
+
+Exceptions
+----------
+
+Notifications of exceptions are the most difficult of all notifications to describe and to understand. This is because of the inherent complexity in exception processing. The set of exception notifications described below was designed to provide all the information required for a sophisticated profiler – so that, at every instant, it can keep track of which pass (first or second), which frame, which filter and which finally block is being executed, for every thread in the profilee process. Note that the Exception notifications do not provide any _threadID's_ but a profiler can always call _ICorProfilerInfo::GetCurrentThreadID_ to discover which managed thread throws the exception.
+
+![Exception callback sequence]: profiling-exception-callback-sequence.png
+
+The figure above displays how the code profiler receives the various callbacks, when monitoring exception events. Each thread starts out in "Normal Execution." When the thread is in a state within the big gray box, the exception system has control of the thread—any non-exception-related callbacks (e.g. ObjectAllocated) that occur while the thread is in one of these states may be attributed to the exception system itself. When the thread is in a state outside of the big gray box, it is running arbitrary managed code.
+
+### Nested Exceptions
+
+Threads that have transitioned into managed code in the midst of processing an exception could throw another exception, which would result in a whole new pass of exception handling (the "New EH Pass" boxes above). If such a "nested" exception escapes the filter/finally/catch from the original exception, it can affect the original exception:
+
+- If the nested exception occurred within a filter, and escapes the filter, the filter will be considered to return "false" and the first pass will continue.
+- If the nested exception occurred within a finally, and escapes the finally, the original exception's processing will never resume.
+- If the nested exception occurred within a catch, and escapes the catch, the original exception's processing will never resume.
+
+### Unmanaged Handlers
+
+An exception might be handled in unmanaged code. In this case, the profiler will see the unwind phase, but no notification of any catch handlers. Execution will simply resume normally in the unmanaged code. An unmanaged-aware profiler will be able to detect this, but a managed-only profiler may see any number of things, including but not limited to:
+
+- An UnmanagedToManagedTransition callback as the unmanaged code calls or returns to managed code.
+- Thread termination (if the unmanaged code was at the root of the thread).
+- App termination (if the unmanaged code terminates the app).
+
+### CLR Handlers
+
+An exception might be handled by the CLR itself. In this case, the profiler will see the unwind phase, but no notification of any catch handlers. It may see execution resume normally in managed or unmanaged code.
+
+### Unhandled Exceptions
+
+By default, an unhandled exception will lead to process termination. If an application has locked back to the legacy exception policy, an unhandled exception on certain kinds of threads may only lead to thread termination.
+
+Code Generation
+---------------
+
+### Getting from IL to Native Code
+
+The IL in a .NET assembly may get compiled to native code in one of two ways: it may get JIT-compiled at run time, or it may be compiled into a "native image" by a tool called NGEN.exe (or CrossGen.exe for CoreCLR). Both the JIT-compiler and NGEN have a number of flags that control code generation.
+
+At the time an assembly is loaded, the CLR first looks for a native image for the assembly. If no native image is found with the right set of code-generation flags, the CLR will JIT-compile the functions in the assembly as they are needed during the run. Even when a native image is found and loaded, the CLR may end up JIT-compiling some of the functions in the assembly.
+
+### Profiler Control over Code-Generation
+
+The profiler has control over code generation, as described below:
+
+| **Flag** | **Effect** |
+| ------------------------------ | --- |
+| COR\_PRF\_USE\_PROFILE\_IMAGES | Causes the native image search to look for profiler-enhanced images (ngen /profile).Has no effect on JITted code. |
+| COR\_PRF\_DISABLE\_INLINING | Has no effect on the native image search.If JITting, disables inlining. All other optimizations remain in effect. |
+| COR\_PRF\_DISABLE\_OPTIMIZATIONS | Has no effect on the native image search.If JITting, disables all optimizations, including inlining. |
+| COR\_PRF\_MONITOR\_ENTERLEAVE | Causes the native image search to look for profiler-enhanced images (ngen /profile).If JITting, inserts enter/leave hooks into the generated code. |
+| COR\_PRF\_MONITOR\_CODE\_TRANSITIONS | Causes the native image search to look for profiler-enhanced images (ngen /profile).If JITting, inserts hooks at managed/unmanaged transition points. |
+
+### Profilers and Native Images
+
+When NGEN.exe creates a native image, it does much of the work that the CLR would have done at run-time—for example, class loading and method compilation. As a result, in cases where work was done at NGEN time, certain profiler callbacks will not be received at run-time:
+
+- JITCompilation\*
+- ClassLoad\*, ClassUnload\*
+
+To deal with this situation, profilers that do not wish to perturb the process by requesting profiler-enhanced native images should be prepared to lazily gather any data required about FunctionIDs or ClassIDs as they are encountered.
+
+### Profiler-Enhanced Native Images
+
+Creating a native image with NGEN /profile turns on a set of code-generation flags that make the image easier to profile:
+
+- Enter/leave hooks are inserted into the code.
+- Managed/unmanaged transition hooks are inserted into the code.
+- JITCachedFunctionSearch notifications are given as each function in the native image is invoked for the first time.
+- ClassLoad notifications are given as each class in the native image is used for the first time.
+
+Because profiler-enhanced native images differ significantly from regular ones, profilers should only use them when the extra perturbation is acceptable.
+
+TODO: Instrumentation
+
+TODO: Remoting
+
+Security Issues in Profiling
+============================
+
+A profiler DLL is an unmanaged DLL that is effectively running as part of the CLR's execution engine itself. As a result, the code in the profiler DLL is not subject to the restrictions of managed code-access security, and the only limitations on it are those imposed by the OS on the user running the profiled application.
+
+Combining Managed and Unmanaged Code in a Code Profiler
+=======================================================
+
+A close review of the CLR Profiling API creates the impression that you could write a profiler that has managed and unmanaged components that call to each other through COM Interop or ndirect calls.
+
+Although this is possible from a design perspective, the CLR Profiling API does not support it. A CLR profiler is supposed to be purely unmanaged. Attempts to combine managed and unmanaged code from a CLR profiler can cause crashes, hangs and deadlocks. The danger is clear since the managed parts of the profiler will "fire" events back to its unmanaged component, which subsequently would call into the managed part of the profiler etc. The danger at this point is clear.
+
+The only location that a CLR profiler could invoke managed code safely would be through replacement of the MSIL body of a method. The profiler before the JIT-compilation of a function is completed inserts managed calls in the MSIL body of a method and then lets the JIT compile it. This technique can successfully be used for selective instrumentation of managed code, or it can be used to gather statistics and times about the JIT.
+
+Alternatively a code profiler could insert native "hooks" in the MSIL body of every managed function that call into unmanaged code. That technique could be used for instrumentation and coverage. For example a code profiler could be inserting instrumentation hooks after every MSIL block to ensure that the block has been executed. The modification of the MSIL body of a method is very delicate operation and there are many factors that should be taken into consideration.
+
+Profiling Unmanaged Code
+========================
+
+There is minimal support in the Runtime profiling interfaces for profiling unmanaged code. The following functionality is provided:
+
+- Enumeration of stack chains. This allows a code profiler to determine the boundary between managed code and unmanaged code.
+- Determine if a stack chain corresponds to managed or native code.
+
+These methods are available through the in-process subset of the CLR debugging API. These are defined in the CorDebug.IDL and explained in DebugRef.doc, please refer to both for more details.
+
+Sampling Profilers
+==================
+
+Hijacking
+---------
+
+Some sampling profilers operate by hijacking the thread at sample time and forcing it to do the work of the sample. This is a very tricky practice that we do not recommend. The rest of this section is mostly to discourage you from going this way.
+
+### Timing of Hijacks
+
+A hijacking profiler must track the runtime suspension events (COR\_PRF\_MONITOR\_SUSPENDS). The profiler should assume that when it returns from a RuntimeThreadSuspended callback, the runtime will hijack that thread. The profiler must avoid having its hijack conflict with the runtime's hijack. To do so, the profiler must ensure that:
+
+1. The profiler does not attempt to hijack a thread between RuntimeThreadSuspended and RuntimeThreadResumed.
+1. If the profiler has begun hijacking before the RuntimeThreadSuspended callback was issued, the callback does not return before the hijack completes.
+
+This can be accomplished by some simple synchronization.
+
+#### Initializing the Runtime
+
+If the profiler has its own thread on which it will be calling ICorProfilerInfo functions, it needs to ensure that it calls one such function before doing any thread suspensions. This is because the runtime has per-thread state that needs to be initialized with all other threads running to avoid possible deadlocks.
diff --git a/Documentation/botr/readytorun-overview.md b/Documentation/botr/readytorun-overview.md
new file mode 100644
index 0000000000..9e9f334fea
--- /dev/null
+++ b/Documentation/botr/readytorun-overview.md
@@ -0,0 +1,335 @@
+Managed Executables with Native Code
+===
+
+# Motivation
+
+Since shipping the .NET Runtime over 10 years ago, there has only been one file format which can be used to distribute and deploy managed code components: the CLI file format. This format expresses all execution as machine independent intermediate language (IL) which must either be interpreted or compiled to native code sometime before the code is run. This lack of an efficient, directly executable file format is a very significant difference between unmanaged and managed code, and has become more and more problematic over time. Problems include:
+
+- Native code generation takes a relatively long time and consumes power.
+- For security / tamper-resistance, there is a very strong desire to validate any native code that gets run (e.g. code is signed).
+- Existing native codegen strategies produce brittle code such that when the runtime or low level framework is updated, all native code is invalidated, which forces the need for recompilation of all that code.
+
+All of these problems and complexity are things that unmanaged code simply avoids. They are avoided because unmanaged code has a format with the following characteristics:
+
+- The executable format can be efficiently executed directly. Very little needs to be updated at runtime (binding _some_ external references) to prepare for execution. What does need to be updated can be done lazily.
+- As long as a set of known versioning rules are followed, version compatible changes in one executable do not affect any other executable (you can update your executables independently of one another).
+- The format is clearly defined, which allows variety of compilers to produce it.
+
+In this proposal we attack this discrepancy between managed and unmanaged code head on: by giving managed code a file format that has the characteristics of unmanaged code listed above. Having such a format brings managed up to at least parity with unmanaged code with respect to deployment characteristics. This is a huge win!
+
+
+## Problem Constraints
+
+The .NET Runtime has had a native code story (NGEN) for a long time. However what is being proposed here is architecturally different than NGEN. NGEN is fundamentally a cache (it is optional and only affects the performance of the app) and thus the fragility of the images was simply not a concern. If anything changes, the NGEN image is discarded and regenerated. On the other hand:
+
+**A native file format carries a strong guarantee that the file will continue to run despite updates and improvements to the runtime or framework.**
+
+Most of this proposal is the details of achieving this guarantee while giving up as little performance as possible.
+
+This compatibility guarantee means that, unlike NGEN, anything you place in the file is a _liability_ because you will have to support it in all future runtimes. This drives a desire to be 'minimalist' and only place things into the format that really need to be there. For everything we place into the format we have to believe either:
+
+1. It is very unlikely to change (in particular we have not changed it over the current life of CLR)
+2. We have a scheme in which we can create future runtimes that could support both old and new format efficiently (both in terms of runtime efficiency and engineering complexity).
+
+Each feature of the file format needs to have an answer to the question of how it versions, and we will be trying to be as 'minimalist' as possible.
+
+
+## Solution Outline
+
+As mentioned, while NGEN is a native file format, it is not an appropriate starting point for this proposal because it is too fragile.
+
+Looking carefully at the CLI file format shows that it is really 'not that bad' as a starting point. At its heart CLI is a set of database-like tables (one for types, methods, fields, etc.), which have entries that point at variable-length things (e.g. method names, signatures, method bodies). Thus CLI is 'pay for play' and since it is already public and version resilient, there is very little downside to including it in the format. By including it we also get the following useful properties:
+
+- Immediate support for _all_ features of the runtime (at least for files that include complete CLI within them)
+- The option to only add the 'most important' data required to support fast, direct execution. Everything else can be left in CLI format and use the CLI code paths. This is quite valuable given our desire to be minimalist in augmenting the format.
+
+Moreover there is an 'obvious' way of extending the CIL file to include the additional data we need. A CLI file has a well-defined header structure, and that header already has a field that can point of to 'additional information'. This is used today in NGEN images. We would use this same technique to allow the existing CLI format to include a new 'Native Header' that would then point at any additional information needed to support fast, direct execution.
+
+The most important parts of this extra information include:
+
+1. Native code for the methods (as well as a way of referencing things outside the module)
+2. Garbage Collection (GC) information for each method that allows you to know what values in registers and on the stack are pointers to the GC heap wherever a GC is allowed.
+3. Exception handling (EH) tables that allow an exception handler to be found when an exception is thrown.
+4. A table that allows the GC and EH to be found given just the current instruction pointer (IP) within the code. (IP map).
+5. A table that links the information in the metadata to the corresponding native structure.
+
+That is, we need something to link the world of metadata to the world of native. We can't eliminate meta-data completely because we want to support existing functionality. In particular we need to be able to support having other CLI images refer to types, methods and fields in this image. They will do so by referencing the information in the metadata, but once they find the target in the metadata, we will need to find the actual native code or type information corresponding to that meta-data entry. This is the purpose of the additional table. Effectively, this table is the 'export' mechanism for managed references.
+
+Some of this information can be omitted or stored in more efficient form, e.g.:
+
+- The garbage collection information can be omitted for environments with conservative garbage collection, such as IL2CPP.
+- The full metadata information is not strictly required for 'private' methods or types so it is possible to strip it from the CLI image.
+- The metadata can be stored in more efficient form, such as the .NET Native metadata format.
+- The platform native executable format (ELF, Mach-O) can be used as envelope instead of PE to take advantage of platform OS loader.
+
+
+## Definition of Version Compatibility for Native Code
+
+Even for IL or unmanaged native code, there are limits to what compatible changes can be made. For example, deleting a public method is sure to be an incompatible change for any extern code using that method.
+
+Since CIL already has a set of [compatibility rules](https://github.com/dotnet/corefx/blob/master/Documentation/coding-guidelines/breaking-changes.md), ideally the native format would have the same set of compatibility rules as CIL. Unfortunately, that is difficult to do efficiently in all cases. In those cases we have multiple choices:
+
+1. Change the compatibility rules to disallow some changes
+2. Never generate native structures for the problematic cases (fall back to CIL techniques)
+3. Generate native structures for the problematic cases, but use them only if there was no incompatible change made
+4. Generate less efficient native code that is resilient
+
+Generally the hardest versioning issues revolve around:
+
+- Value types (structs)
+- Generic methods over value types (structs)
+
+These are problematic because value classes are valuable precisely _because_ they have less overhead than classes. They achieve this value by being 'inlined' where they are used. This makes the code generated for value classes very fragile with respect to any changes to the value class's layout, which is bad for resilience. Generics over structs have a similar issue.
+
+Thus this proposal does _not_ suggest that we try to solve the problem of having version resilience in the presence of layout changes to value types. Instead we suggest creating a new compatibility rule:
+
+**It is a breaking change to change the number or type of any (including private) fields of a public value type (struct). However if the struct is non-public (that is internal), and not reachable from any nesting of value type fields in any public value type, then the restriction does not apply.**
+
+This is a compatibility that is not present for CIL. All other changes allowed by CIL can be allowed by native code without prohibitive penalty. In particular the following changes are allowed:
+
+1. Adding instance and static fields to reference classes
+2. Adding static fields to a value class.
+3. Adding virtual, instance or static methods to a reference or value class
+4. Changing existing methods (assuming the semantics is compatible).
+5. Adding new classes.
+
+
+## Version Bubbles
+
+When changes to managed code are made, we have to make sure that all the artifacts in a native code image _only_ depend on information in other modules that _cannot_ _change_ without breaking the compatibility rules. What is interesting about this problem is that the constraints only come into play when you _cross_ module boundaries.
+As an example, consider the issue of inlining of method bodies. If module A would inline a method from Module B, that would break our desired versioning property because now if that method in module B changes, there is code in Module A that would need to be updated (which we do not wish to do). Thus inlining is illegal across modules. Inlining _within_ a module, however, is still perfectly fine.
+
+Thus in general the performance impact of versioning decreases as module size increases because there are fewer cross-module references. We can take advantage of this observation by defining something called a version bubble. **A version bubble is a set of DLLs that we are willing to update as a set.** From a versioning perspective, this set of DLLs is a single module. Inlining and other cross-module optimizations are allowed within a version bubble.
+
+It is worth reiterating the general principle covered in this section
+
+**Code of methods and types that do NOT span version bubbles does NOT pay a performance penalty.**
+
+This principle is important because it means that only a fraction (for most apps a small fraction) of all code will pay any performance penalties we discuss in the sections that follow.
+
+The extreme case is where the entire application is a single version bubble. This configuration does not need to pay any performance penalty for respecting versioning rules. It still benefits from a clearly defined file format and runtime contract that are the essential part of this proposal.
+
+## Runtime Versioning
+
+The runtime versioning is solved using different techniques because the runtime is responsible for interpretation of the binary format.
+
+To allow changes in the runtime, we simply require that the new runtime handle all old formats as well as the new format. The 'main defense' in the design of the file format is having version numbers on important structures so that the runtime has the option of supporting a new version of that structure as well as the old version unambiguously by checking the version number. Fundamentally, we are forcing the developers of the runtime to be aware of this constraint and code and test accordingly.
+
+### Restrictions on Runtime Evolution
+
+As mentioned previously, when designing for version compatibility we have the choice of either simply disallowing a change (by changing the breaking change rules), or insuring that the format is sufficiently flexible to allow evolution. For example, for managed code we have opted to disallow changes to value type (struct) layout so that codegen for structs can be efficient. In addition, the design also includes a small number of restrictions that affect the flexibility of evolving the runtime itself. They are:
+
+- The field layout of `System.Object` cannot change. (First, there is a pointer sized field for type information and then the other fields.)
+- The field layout of arrays cannot change. (First, there is a pointer sized field for type information, and then a pointer sized field for the length. After these fields is the array data, packed using existing alignment rules.)
+- The field layout of `System.String` cannot change. (First, there is a pointer sized field for type information, and then a int32 sized field for the length. After these fields is the zero terminated string data in UTF16 encoding.)
+
+These restrictions were made because the likelihood of ever wanting to change these restrictions is low, and the performance cost _not_ having these assumptions is high. If we did not assume the field layout of `System.Object` never changes, then _every_ field fetch object outside the framework itself would span a version bubble and pay a penalty. Similarly if we don't assume the field layout for arrays or strings, then every access will pay a versioning penalty.
+
+## Selective use of the JIT
+
+One final point that is worth making is that selective use of the JIT compiler is another tool that can be used to avoid code quality penalties associated with version resilience, in environments where JITing is permitted. For example, assume that there is a hot user method that calls across a version bubble to a method that would a good candidate for inlining, but is not inlined because of versioning constraints. For such cases, we could have an attribute that indicates that a particular method should be compiled at runtime. Since the JIT compiler is free to generate fragile code, it can perform this inlining and thus the program steady-state performance improves. It is true that a startup time cost has been paid, but if the number of such 'hot' methods is small, the amount of JIT compilation (and thus its penalty) is not great. The point is that application developers can make this determination on a case by case basis. It is very easy for the runtime to support this capability.
+
+
+# Version Resilient Native Code Generation
+
+Because our new native format starts with the current CLI format, we have the option of falling back to it whenever we wish to. Thus we can choose to add new parts to the format in chunks. In this section we talk about the 'native code' chunk. Here we discuss the parts of the format needed to emit native code for the bodies of 'ordinary' methods. Native images that have this addition information will not need to call the JIT compiler, but will still need to call the type loader to create types.
+
+It is useful to break the problem of generating version resilient native code by CIL instruction. Many CIL instructions (e.g. `ADD`, `MUL`, `LDLOC` ... naturally translate to native code in a version resilient ways. However CIL that deals with object model (e.g. `NEWOBJ`, `LDFLD`, etc) need special care as explained below. The descriptions below are roughly ordered in the performance priority in typical applications. Typically, each section will describe what code generation looks like when all information is within the version bubble, and then when the information crosses version bubbles. We use x64 as our native instruction set, applying the same strategy to other processor architectures is straightforward. We use the following trivial example to demonstrate the concepts
+
+ interface Intf
+ {
+ void intfMethod();
+ }
+
+ class BaseClass
+ {
+ static int sField;
+ int iField;
+
+ public void iMethod()
+ {
+ }
+
+ public virtual void vMethod(BaseClass aC)
+ {
+ }
+ }
+
+ class SubClass : BaseClass, Intf
+ {
+ int subField;
+
+ public override void vMethod(BaseClass aC)
+ {
+ }
+
+ virtual void intfMethod()
+ {
+ }
+ }
+
+## Instance Field access - LDFLD / STFLD
+
+The CLR stores fields in the 'standard' way, so if RCX holds a BaseClass then
+
+ MOV RAX, [RCX + iField_Offset]
+
+will fetch `iField` from this object. `iField_Offset` is a constant known at native code generation time. This is known at compile time only because we mandated that the field layout of `System.Object` is fixed, and thus the entire inheritance chain of `BaseClass` is in the version bubble. It's also true even when fields in `BaseClass` contain structs (even from outside the version bubble), because we have made it a breaking change to modify the field layout of any public value type. Thus for types whose inheritance hierarchy does not span a version bubble, field fetch is as it always was.
+
+To consider the inter-bubble case, assume that `SubClass` is defined in a different version bubble than BaseClass and we are fetching `subField`. The normal layout rules for classes require `subField` to come after all the fields of `BaseClass`. However `BaseClass` could change over time, so we can't wire in a literal constant anymore. Instead we require the following code
+
+ MOV TMP, [SIZE_OF_BASECLASS]
+ MOV EAX, [RCX + TMP + subfield_OffsetInSubClass]
+
+ .data // In the data section
+ SIZE_OF_BASECLASS: UINT32 // One per EXTERN CLASS that is subclassed
+
+Which simply assumes that a uint32 sized location has been reserved in the module and that it will be filled in with the size of `BaseClass` before this code is executed. Now a field fetch has one extra instruction, which fetches this size and that dynamic value is used to compute the field. This sequence is a great candidate for CSE (common sub-expression elimination) optimization when multiple fields of the same class are accessed by single method.
+
+A special attention needs to be given to alignment requirements of `SubClass`.
+
+### GC Write Barrier
+
+The .NET GC is generational, which means that most GCs do not collect the whole heap, and instead only collect the 'new' part (which is much more likely to contain garbage). To do this it needs to know the set of roots that point into this 'new' part. This is what the GC write barrier does. Every time an object reference that lives in the GC heap is updated, bookkeeping code needs to be called to log that fact. Any fields whose values were updated are used as potential roots on these partial GCs. The important part here is that any field update of a GC reference must do this extra bookkeeping.
+
+The write barrier is implemented as a set of helper functions in the runtime. These functions have special calling conventions (they do not trash any registers). Thus these helpers act more like instructions than calls. The write barrier logic does not need to be changed to support versioning (it works fine the way it is).
+
+
+### Initializing the field size information
+
+A key observation is that you only need this overhead for each distinct class that inherits across a version bubble. Thus there is unlikely to be many slots like `SIZE_OF_BASECLASS`. Because there are likely to be few of them, the compiler can choose to simply initialize them at module load.
+
+Note that if you accessed an instance field of a class that was defined in another module, it is not the size that you need but the offset of a particular field. The code generated will be the same (in fact it will be simpler as no displacement is needed in the second instruction). Our coding guidelines strongly discourage public instance fields so this scenario is not particularly likely in practice (it will end up being a property call) but we can handle it in a natural way. Note also that even complex inheritance hierarchies that span multiple version bubbles are not a problem. In the end all you need is the final size of the base type. It might take a bit longer to compute during one time initialization, but that is the extent of the extra cost.
+
+### Performance Impact
+
+Clearly we have added an instruction and thus made the code bigger and more expensive to run. However what is also true is that the additional cost is small. The 'worst' case would be if this field fetch was in a tight loop. To measure this we created a linked list element which inherited across a version bubble. The list was long (1K) but small enough to fit in the L1 cache. Even for this extreme example (which by the way is contrived, linked list nodes do not normally inherit in such a way), the extra cost was small (< 1%).
+
+### Null checks
+
+The managed runtime requires any field access on null instance pointer to generate null reference exception. To avoid inserting explicit null checks, the code generator assumes that memory access at addresses smaller than certain threshold (64k on Windows NT) will generate null reference exception. If we allowed unlimited growth of the base class for cross-version bubble inheritance hierarchies, this optimization would be no longer possible.
+
+To make this optimization possible, we will limit growth of the base class size for cross-module inheritance hierarchies. It is a new versioning restriction that does not exist in IL today.
+
+
+## Non-Virtual Method Calls - CALL
+
+### Intra-module call
+
+If RCX holds a `BaseClass` and the caller of `iMethod` is in the same module as BaseClass then a method call is simple machine call instruction
+
+ CALL ENTRY_IMETHOD
+
+### Inter-module call
+
+However if the caller is outside the module of BaseClass (even if it is in the same version bubble) we need to call it using an indirection
+
+ CALL [PTR_IMETHOD]
+
+ .data // In the data section
+ PTR_IMETHOD: PTR = RUNTIME_ENTRY_FIXUP_METHOD // One per call TARGET.
+
+Just like the field case, the pointer sized data slot `PTR_IMETHOD` must be fixed up to point at the entry point of `BaseClass.iMethod`. However unlike the field case, because we are fixing up a call (and not a MOV), we can have the call fix itself up lazily via standard delay loading mechanism.
+The delay loading mechanism often uses low-level tricks for maximum efficiency. Any low-level implementation of delay loading can be used as long as the resolution of the call target is left to the runtime.
+
+### Retained Flexibility for runtime innovation
+
+Note that it might seem that we have forever removed the possibility of innovating in the way we do SLOT fixup, since we 'burn' these details into the code generation and runtime helpers. However this is not true. What we have done is require that we support the _current_ mechanism for doing such fixup. Thus we must always support a `RUNTIME_ENTRY_FIXUP_METHOD` helper. However we could devise a completely different scheme. All that would be required is that you use a _new_ helper and _keep_ the old one. Thus you can have a mix of old and new native code in the same process without issue.
+
+### Calling Convention
+
+The examples above did not have arguments and the issue of calling convention was not obvious. However it is certainly true that the native code at the call site does depend heavily on the calling convention and that convention must be agreed to between the caller and the callee at least for any particular caller-callee pair.
+
+The issue of calling convention is not specific to managed code and thus hardware manufacturers typically define a calling convention that tends to be used by all languages on the system (thus allowing interoperability). In fact for all platforms except x86, CLR attempts to follow the platform calling convention.
+
+Our understanding of the most appropriate managed convention evolved over time. Our experience tells us that it is worthwhile for implementation simplicity to always pass managed `this` pointer in the fixed register, even if the platform standard calling convention says otherwise.
+
+#### Managed Code Specific Conventions
+
+In addition the normal conventions for passing parameters as well as the normal convention of having a hidden byref parameter for returning value types, CLR has a few managed code specific argument conventions:
+
+1. Shared generic code has a hidden parameter that represents the type parameters in some cases for methods on generic types and for generic methods.
+2. GC interactions with hidden return buffer. The convention for whether the hidden return buffer can be allocated in the GC heap, and thus needs to be written to using write barrier.
+
+These conventions would be codified as well.
+
+### Performance Impact
+
+Because it was already the case that methods outside the current module had to use an indirect call, versionability does not introduce more overhead for non-virtual method calls if inlining was not done. Thus the main cost of making the native code version resilient is the requirement that no cross version bubble inlining can happen.
+
+The best solution to this problem is to avoid 'chatty' library designs (Unfortunately, `IEnumerable`, is such a chatty design, where each iteration does a `MoveNext` and `Current` property fetch). Another mitigation is the one mentioned previously: to allow clients of the library to selectively JIT compile some methods that make these chatty calls. Finally you can also use new custom `NonVersionableAttribute` attribute, which effectively changes the versioning contract to indicate that the library supplier has given up his right to change that method's body and thus it would be legal to inline.
+
+The proposal is to disallow cross-version bubble inlining by default, and selectively allow inlining for critical methods (by giving up the right to change the method).
+
+Experiments with disabled cross-module inlining with the selectively enabled inlining of critical methods showed no visible regression in ASP.NET throughput.
+
+## Non-Virtual calls as the baseline solution to all other versioning issues
+
+It is important to observe that once you have a mechanism for doing non-virtual function calls in a version resilient way (by having an indirect CALL through a slot that that can be fixed lazily at runtime, all other versioning problems _can_ be solved in that way by calling back to the 'definer' module, and having the operation occur there instead. Issues associated with this technique
+
+1. You will pay the cost of a true indirection function call and return, as well as any argument setup cost. This cost may be visible in constructs that do not contain a call naturally, like fetching string literals or other constants. You may be able to get better performance from another technique (for example, we did so with instance field access).
+2. It introduces a lot of indirect calls. It is not friendly to systems that disallow on the fly code generation. A small helper stub has to be created at runtime in the most straightforward implementation, or there has to be a scheme how to pre-create or recycle the stubs.
+3. It requires that the defining assembly 'know' the operations that it is responsible for defining. In general this could be fixed by JIT compiling whatever is needed at runtime (where the needed operations are known), but JIT compiling is the kind of expensive operation that we are trying to avoid at runtime.
+
+So while there are limitations to the technique, it works very well on a broad class of issues, and is conceptually simple. Moreover, it has very nice simplicity on the caller side (a single indirect call). It is hard to get simpler than this. This simplicity means that you have wired very few assumptions into the caller which maximizes the versioning flexibility, which is another very nice attribute. Finally, this technique also allows generation of optimal code once the indirect call was made. This makes for a very flexible technique that we will use again and again.
+
+The runtime currently supports two mechanisms for virtual dispatch. One mechanism is called virtual stub dispatch (VSD). It is used when calling interface methods. The other is a variation on traditional vtable-based dispatch and it is used when a non-interface virtual is called. We first discuss the VSD approach.
+
+Assume that RCX holds a `Intf` then the call to `intfMethod()` would look like
+
+ CALL [PTR_CALLSITE]
+ .data // in the data section
+ PTR_CALLSITE: INT_PTR = RUNTIME_ENTRY_FIXUP_METHOD // One per call SITE.
+
+This looks same as the cross-module, non-virtual case, but there are important differences. Like the non-virtual case there is an indirect call through a pointer that lives in the module. However unlike the non-virtual case, there is one such slot per call site (not per target). What is in this slot is always guaranteed to get to the target (in this case to `Intf.intfMethod()`), but it is expected to change over time. It starts out pointing to a 'dumb' stub which simply calls a runtime helper that does the lookup (in likely a slow way). However, it can update the `PTR_CALLSITE` slot to a stub that efficiently dispatches to the interface for the type that actually occurred (the remaining details of stubbed based interface dispatch are not relevant to versioning).
+
+The above description is accurate for the current CLR implementation for interface dispatch. What's more, is that nothing needs to be changed about the code generation to make it version resilient. It 'just works' today. Thus interface dispatch is version resilient with no performance penalty.
+
+What's more, we can actually see VSD is really just a modification of the basic 'indirect call through updateable slot' technique that was used for non-virtual method dispatch. The main difference is that because the target depends on values that are not known until runtime (the type of the 'this' pointer), the 'fixup' function can never remove itself completely but must always check this runtime value and react accordingly (which might include fixing up the slot again). To make as likely as possible that the value in the fixup slot stabilizes, we create a fixup slot per call site (rather than per target).
+
+### Vtable Dispatch
+
+The CLR current also supports doing virtual dispatch through function tables (vtables). Unfortunately, vtables have the same version resilience problem as fields. This problem can be fixed in a similar way, however unlike fields, the likelihood of having many cross bubble fixups is higher for methods than for instance fields. Further, unlike fields we already have a version resilient mechanism that works (VSD), so it would have to be better than that to be worth investing in. Vtable dispatch is only better than VSD for polymorphic call sites (where VSD needs to resort to a hash lookup). If we find we need to improve dispatch for this case we have some possible mitigations to try:
+
+1. If the polymorphism is limited, simply trying more cases before falling back to the hash table has been prototyped and seems to be a useful optimization.
+2. For high polymorphism case, we can explore the idea of dynamic vtable slots (where over time the virtual method a particular vtable slot holds can change). Before falling back to the hash table a virtual method could claim a vtable slot and now the dispatch of that method for _any_ type will be fast.
+
+In short, because of the flexibility and natural version resilience of VSD, we propose determining if VSD can be 'fixed' before investing in making vtables version resilient and use VSD for all cross version bubble interface dispatch. This does not preclude using vtables within a version bubble, nor adding support for vtable based dispatch in the future if we determine that VSD dispatch can't be fixed.
+
+
+## Object Creation - NEWOBJ / NEWARR
+
+Object allocation is always done by a helper call that allocates the uninitialized object memory (but does initialize the type information `MethodTable` pointer), followed by calling the class constructor. There are a number of different helpers depending on the characteristics of the type (does it have a finalizer, is it smaller than a certain size, ...).
+
+We will defer the choice of the helper to use to allocate the object to runtime. For example, to create an instance of `SubClass` the code would be:
+
+ CALL [NEWOBJ_SUBCLASS]
+ MOV RCX, RAX // EAX holds the new object
+ // If the constructor had parameters, set them
+ CALL SUBCLASS_CONSTRUCTOR
+
+ .data // In the data section
+ NEWOBJ_SUBCLASS: RUNTIME_ENTRY_FIXUP // One per type
+
+where the `NEWOBJ_SUBCLASS` would be fixed up using the standard lazy technique.
+
+The same technique works for creating new arrays (NEWARR instruction).
+
+
+## Type Casting - ISINST / CASTCLASS
+
+The proposal is to use the same technique as for object creation. Note that type casting could easily be a case where VSD techniques would be helpful (as any particular call might be monomorphic), and thus caching the result of the last type cast would be a performance win. However this optimization is not necessary for version resilience.
+
+
+## GC Information for Types
+
+To do its job the garbage collector must be able to take an arbitrary object in the GC heap and find all the GC references in that object. It is also necessary for the GC to 'scan' the GC from start to end, which means it needs to know the size of every object. Fast access to two pieces of information is what is needed.
+From a versioning perspective, the fundamental problem with GC information is that (like field offsets) it incorporates information from the entire inheritance hierarchy in general case. This means that the information is not version resilient.
+
+While it is possible to make the GC information resilient and have the GC use this resilient data, GC happens frequently and type loading happens infrequently, so arguably you should trade type loading speed for GC speed if given the choice. Moreover the size of the GC information is typically quite small (e.g. 12-32 bytes) and will only occur for those types that cross version bubbles. Thus forming the GC information on the fly (from a version resilient form) is a reasonable starting point.
+
+Another important observation is that `MethodTable` contains other very frequently accessed data, like flags indicating whether the `MethodTable` represents an array, or pointer to parent type. This data tends to change a lot with the evolution of the runtime. Thus, generating method tables at runtime will solve a number of other versioning issues in addition to the GC information versioning.
+
+# Current State
+
+The design and implementation is a work in progress under code name ReadyToRun (`FEATURE_READYTORUN`). RyuJIT is used as the code generator to produce the ReadyToRun images currently.
diff --git a/Documentation/botr/ryujit-overview.md b/Documentation/botr/ryujit-overview.md
new file mode 100644
index 0000000000..ee84a9a9dc
--- /dev/null
+++ b/Documentation/botr/ryujit-overview.md
@@ -0,0 +1,558 @@
+JIT Compiler Structure
+===
+
+# Introduction
+
+RyuJIT is the code name for the next generation Just in Time Compiler (aka “JIT”) for the AMD64 .NET runtime. Its first implementation is for the AMD64 architecture. It is derived from a code base that is still in use for the other targets of .NET.
+
+The primary design considerations for RyuJIT are to:
+
+* Maintain a high compatibility bar with previous JITs, especially those for x86 (jit32) and x64 (jit64).
+* Support and enable good runtime performance through code optimizations, register allocation, and code generation.
+* Ensure good throughput via largely linear-order optimizations and transformations, along with limitations on tracked variables for analyses (such as dataflow) that are inherently super-linear.
+* Ensure that the JIT architecture is designed to support a range of targets and scenarios.
+
+The first objective was the primary motivation for evolving the existing code base, rather than starting from scratch or departing more drastically from the existing IR and architecture.
+
+# Execution Environment and External Interface
+
+RyuJIT provides the just in time compilation service for the .NET runtime. The runtime itself is variously called the EE (execution engine), the VM (virtual machine) or simply the CLR (common language runtime). Depending upon the configuration, the EE and JIT may reside in the same or different executable files. RyuJIT implements the JIT side of the JIT/EE interfaces:
+
+* `ICorJitCompiler` – this is the interface that the JIT compiler implements. This interface is defined in [src/inc/corjit.h](https://github.com/dotnet/coreclr/blob/master/src/inc/corjit.h) and its implementation is in [src/jit/ee_il_dll.cpp](https://github.com/dotnet/coreclr/blob/master/src/jit/ee_il_dll.cpp). The following are the key methods on this interface:
+ * `compileMethod` is the main entry point for the JIT. The EE passes it a `ICorJitInfo` object, and the “info” containing the IL, the method header, and various other useful tidbits. It returns a pointer to the code, its size, and additional GC, EH and (optionally) debug info.
+ * `getVersionIdentifier` is the mechanism by which the JIT/EE interface is versioned. There is a single GUID (manually generated) which the JIT and EE must agree on.
+ * `getMaxIntrinsicSIMDVectorLength` communicates to the EE the largest SIMD vector length that the JIT can support.
+* `ICorJitInfo` – this is the interface that the EE implements. It has many methods defined on it that allow the JIT to look up metadata tokens, traverse type signatures, compute field and vtable offsets, find method entry points, construct string literals, etc. This bulk of this interface is inherited from `ICorJitDynamicInfo` which is defined in [src/inc/corinfo.h](https://github.com/dotnet/coreclr/blob/master/src/inc/corinfo.h). The implementation is defined in [src/vm/jitinterface.cpp](https://github.com/dotnet/coreclr/blob/master/src/vm/jitinterface.cpp).
+
+# Internal Representation (IR)
+
+## Overview of the IR
+
+The RyuJIT IR can be described at a high level as follows:
+
+* The Compiler object is the primary data structure of the JIT. Each method is represented as a doubly-linked list of `BasicBlock` objects. The Compiler object points to the head of this list with the `fgFirstBB` link, as well as having additional pointers to the end of the list, and other distinguished locations.
+ * `ICorJitCompiler::CompileMethod()` is invoked for each method, and creates a new Compiler object. Thus, the JIT need not worry about thread synchronization while accessing Compiler state. The EE has the necessary synchronization to ensure there is a single JIT’d copy of a method when two or more threads try to trigger JIT compilation of the same method.
+* `BasicBlock` nodes contain a list of doubly-linked statements with no internal control flow (there is an exception for the case of the qmark/colon operator)
+ * The `BasicBlock` also contains the dataflow information, when available.
+* `GenTree` nodes represent the operations and statement of the method being compiled.
+ * It includes the type of the node, as well as value number, assertions, and register assignments when available.
+* `LclVarDsc` represents a local variable, argument or JIT-created temp. It has a `gtLclNum` which is the identifier usually associated with the variable in the JIT and its dumps. The `LclVarDsc` contains the type, use count, weighted use count, frame or register assignment etc. These are often referred to simply as “lclVars”. They can be tracked (`lvTracked`), in which case they participate in dataflow analysis, and have a different index (`lvVarIndex`) to allow for the use of dense bit vectors.
+
+![RyuJIT IR Overview](../images/ryujit-ir-overview.png)
+
+The IR has two modes:
+
+* In tree-order mode, non-statement nodes (often described as expression nodes, though they are not always strictly expressions) are linked only via parent-child links (unidirectional). That is, the consuming node has pointers to the nodes that produce its input operands.
+* In linear-order mode, non-statement nodes have both parent-child links as well as execution order links (`gtPrev` and `gtNext`).
+ * In the interest of maintaining functionality that depends upon the validity of the tree ordering, the linear mode of the `GenTree` IR has an unusual constraint that the execution order must represent a valid traversal of the parent-child links.
+
+A separate representation, `insGroup` and `instrDesc`, is used during the actual instruction encoding.
+
+### Statement Order
+
+During the “front end” of the JIT compiler (prior to Rationalization), the execution order of the `GenTree` nodes on a statement is fully described by the “tree” order – that is, the links from the top node of a statement (the `gtStmtExpr`) to its children. The order is determined by a depth-first, left-to-right traversal of the tree, with the exception of nodes marked `GTF_REVERSE_OPS` on binary nodes, whose second operand is traversed before its first.
+
+After rationalization, the execution order can no longer be deduced from the tree order alone. At this point, the dominant ordering becomes “linear order”. This is because at this point any `GT_COMMA` nodes have been replaced by embedded statements, whose position in the execution order can only be determined by the `gtNext` and `gtPrev` links on the tree nodes.
+
+This modality is captured in the `fgOrder` flag on the Compiler object – it is either `FGOrderTree` or `FGOrderLinear`.
+
+## GenTree Nodes
+
+Each operation is represented as a GenTree node, with an opcode (GT_xxx), zero or more child `GenTree` nodes, and additional fields as needed to represent the semantics of that node.
+
+The `GenTree` nodes are doubly-linked in execution order, but the links are not necessarily valid during all phases of the JIT.
+
+The statement nodes utilize the same `GenTree` base type as the operation nodes, though they are not truly related.
+
+* The statement nodes are doubly-linked. The first statement node in a block points to the last node in the block via its `gtPrev` link. Note that the last statement node does *not* point to the first; that is, the list is not fully circular.
+* Each statement node contains two `GenTree` links – `gtStmtExpr` points to the top-level node in the statement (i.e. the root of the tree that represents the statement), while `gtStmtList` points to the first node in execution order (again, this link is not always valid).
+
+### Example of Post-Import IR
+
+For this snippet of code (extracted from [tests/src/JIT/CodeGenBringUpTests/DblRoots.cs](https://github.com/dotnet/coreclr/blob/master/tests/src/JIT/CodeGenBringUpTests/DblRoots.cs)):
+
+ r1 = (-b + Math.Sqrt(b*b - 4*a*c))/(2*a);
+
+A stripped-down dump of the `GenTree` nodes just after they are imported looks like this:
+
+ ▌ stmtExpr void (top level) (IL 0x000...0x026)
+ │ ┌──▌ lclVar double V00 arg0
+ │ ┌──▌ * double
+ │ │ └──▌ dconst double 2.00
+ │ ┌──▌ / double
+ │ │ │ ┌──▌ mathFN double sqrt
+ │ │ │ │ │ ┌──▌ lclVar double V02 arg2
+ │ │ │ │ │ ┌──▌ * double
+ │ │ │ │ │ │ │ ┌──▌ lclVar double V00 arg0
+ │ │ │ │ │ │ └──▌ * double
+ │ │ │ │ │ │ └──▌ dconst double 4.00
+ │ │ │ │ └──▌ - double
+ │ │ │ │ │ lclVar double V01 arg1
+ │ │ │ │ └──▌ * double
+ │ │ │ │ └──▌ lclVar double V01 arg1
+ │ │ └──▌ + double
+ │ │ └──▌ unary - double
+ │ │ └──▌ lclVar double V01 arg1
+ └──▌ = double
+ └──▌ indir double
+ └──▌ lclVar byref V03 arg3
+
+## Types
+
+The JIT is primarily concerned with “primitive” types, i.e. integers, reference types, pointers, and floating point types. It must also be concerned with the format of user-defined value types (i.e. struct types derived from `System.ValueType`) – specifically, their size and the offset of any GC references they contain, so that they can be correctly initialized and copied. The primitive types are represented in the JIT by the `var_types` enum, and any additional information required for struct types is obtained from the JIT/EE interface by the use of an opaque `CORINFO_CLASS_HANDLE`.
+
+## Dataflow Information
+
+In order to limit throughput impact, the JIT limits the number of lvlVars for which liveness information is computed. These are the tracked lvlVars (`lvTracked` is true), and they are the only candidates for register allocation.
+
+The liveness analysis determines the set of defs, as well as the uses that are upward exposed, for each block. It then propagates the liveness information. The result of the analysis is captured in the following:
+
+* The live-in and live-out sets are captured in the `bbLiveIn` and `bbLiveOut` fields of the `BasicBlock`.
+* The `GTF_VAR_DEF` flag is set on a lvlVar `GenTree` node that is a definition.
+* The `GTF_VAR_USEASG` flag is set (in addition to the `GTF_VAR_DEF` flag) for the target of an update (e.g. +=).
+* The `GTF_VAR_USEDEF` is set on the target of an assignment of a binary operator with the same lvlVar as an operand.
+
+## SSA
+
+Static single assignment (SSA) form is constructed in a traditional manner [[1]](#[1]). The SSA names are recorded on the lvlVar references. While SSA form usually retains a pointer or link to the defining reference, RyuJIT currently retains only the `BasicBlock` in which the definition of each SSA name resides.
+
+## Value Numbering
+
+Value numbering utilizes SSA for lvlVar values, but also performs value numbering of expression trees. It takes advantage of type safety by not invalidating the value number for field references with a heap write, unless the write is to the same field. The IR nodes are annotated with the value numbers, which are indexes into a type-specific value number store. Value numbering traverses the trees, performing symbolic evaluation of many operations.
+
+# Phases of RyuJIT
+
+The top-level function of interest is `Compiler::compCompile`. It invokes the following phases in order.
+
+| **Phase** | **IR Transformations** |
+| --- | --- |
+|[Pre-import](#pre-import)|`Compiler->lvaTable` created and filled in for each user argument and variable. BasicBlock list initialized.|
+|[Importation](#importation)|`GenTree` nodes created and linked in to Statements, and Statements into BasicBlocks. Inlining candidates identified.|
+|[Inlining](#inlining)|The IR for inlined methods is incorporated into the flowgraph.|
+|[Struct Promotion](#struct-promotion)|New lvlVars are created for each field of a promoted struct.|
+|[Mark Address-Exposed Locals](#mark-addr-exposed)|lvlVars with references occurring in an address-taken context are marked. This must be kept up-to-date.|
+|[Morph Blocks](#morph-blocks)|Performs localized transformations, including mandatory normalization as well as simple optimizations.|
+|[Eliminate Qmarks](#eliminate-qmarks)|All `GT_QMARK` nodes are eliminated, other than simple ones that do not require control flow.|
+|[Flowgraph Analysis](#flowgraph-analysis)|`BasicBlock` predecessors are computed, and must be kept valid. Loops are identified, and normalized, cloned and/or unrolled.|
+|[Normalize IR for Optimization](#normalize-ir)|lvlVar references counts are set, and must be kept valid. Evaluation order of `GenTree` nodes (`gtNext`/`gtPrev`) is determined, and must be kept valid.|
+|[SSA and Value Numbering Optimizations](#ssa-vn)|Computes liveness (`bbLiveIn` and `bbLiveOut` on `BasicBlocks`), and dominators. Builds SSA for tracked lvlVars. Computes value numbers.|
+|[Loop Invariant Code Hoisting](#licm)|Hoists expressions out of loops.|
+|[Copy Propagation](#copy-propagation)|Copy propagation based on value numbers.|
+|[Common Subexpression Elimination (CSE)](#cse)|Elimination of redundant subexressions based on value numbers.|
+|[Assertion Propagation](#assertion-propagation)|Utilizes value numbers to propagate and transform based on properties such as non-nullness.|
+|[Range analysis](#range-analysis)|Eliminate array index range checks based on value numbers and assertions|
+|[Rationalization](#rationalization)|Flowgraph order changes from `FGOrderTree` to `FGOrderLinear`. All `GT_COMMA`, `GT_ASG` and `GT_ADDR` nodes are transformed.|
+|[Lowering](#lowering)|Register requirements are fully specified (`gtLsraInfo`). All control flow is explicit.|
+|[Register allocation](#reg-alloc)|Registers are assigned (`gtRegNum` and/or `gtRsvdRegs`),and the number of spill temps calculated.|
+|[Code Generation](#code-generation)|Determines frame layout. Generates code for each `BasicBlock`. Generates prolog & epilog code for the method. Emit EH, GC and Debug info.|
+
+## <a name="pre-import"/>Pre-import
+
+Prior to reading in the IL for the method, the JIT initializes the local variable table, and scans the IL to find branch targets and form BasicBlocks.
+
+## <a name="importation">Importation
+
+Importation is the phase that creates the IR for the method, reading in one IL instruction at a time, and building up the statements. During this process, it may need to generate IR with multiple, nested expressions. This is the purpose of the non-expression-like IR nodes:
+
+* It may need to evaluate part of the expression into a temp, in which case it will use a comma (`GT_COMMA`) node to ensure that the temp is evaluated in the proper execution order – i.e. `GT_COMMA(GT_ASG(temp, exp), temp)` is inserted into the tree where “exp” would go.
+* It may need to create conditional expressions, but adding control flow at this point would be quite messy. In this case it generates question mark/colon (?: or `GT_QMARK`/`GT_COLON`) trees that may be nested within an expression.
+
+During importation, tail call candidates (either explicitly marked or opportunistically identified) are identified and flagged. They are further validated, and possibly unmarked, during morphing.
+
+## Morphing
+
+The `fgMorph` phase includes a number of transformations:
+
+### <a name="inlining"/>Inlining
+
+The `fgInline` phase determines whether each call site is a candidate for inlining. The initial determination is made via a state machine that runs over the candidate method’s IL. It estimates the native code size corresponding to the inline method, and uses a set of heuristics, including the estimated size of the current method) to determine if inlining would be profitable. If so, a separate Compiler object is created, and the importation phase is called to create the tree for the candidate inline method. Inlining may be aborted prior to completion, if any conditions are encountered that indicate that it may be unprofitable (or otherwise incorrect). If inlining is successful, the inlinee compiler’s trees are incorporated into the inliner compiler (the “parent”), with args and returns appropriately transformed.
+
+### <a name="struct-promotion"/>Struct Promotion
+
+Struct promotion (`fgPromoteStructs()`) analyzes the local variables and temps, and determines if their fields are candidates for tracking (and possibly enregistering) separately. It first determines whether it is possible to promote, which takes into account whether the layout may have holes or overlapping fields, whether its fields (flattening any contained structs) will fit in registers, etc.
+
+Next, it determines whether it is likely to be profitable, based on the number of fields, and whether the fields are individually referenced.
+
+When a lvlVar is promoted, there are now N+1 lvlVars for the struct, where N is the number of fields. The original struct lvlVar is not considered to be tracked, but its fields may be.
+
+### <a name="mark-addr-exposed"/>Mark Address-Exposed Locals
+
+This phase traverses the expression trees, propagating the context (e.g. taking the address, indirecting) to determine which lvlVars have their address taken, and which therefore will not be register candidates. If a struct lvlVar has been promoted, and is then found to be address-taken, it will be considered “dependently promoted”, which is an odd way of saying that the fields will still be separately tracked, but they will not be register candidates.
+
+### <a name="morph-blocks"/>Morph Blocks
+
+What is often thought of as “morph” involves localized transformations to the trees. In addition to performing simple optimizing transformations, it performs some normalization that is required, such as converting field and array accesses into pointer arithmetic. It can (and must) be called by subsequent phases on newly added or modified trees. During the main Morph phase, the boolean `fgGlobalMorph` is set on the Compiler argument, which governs which transformations are permissible.
+
+### <a name="eliminate-qmarks"/>Eliminate Qmarks
+
+This expands most `GT_QMARK`/`GT_COLON` trees into blocks, except for the case that is instantiating a condition.
+
+## <a name="flowgraph-analysis"/>Flowgraph Analysis
+
+At this point, a number of analyses and transformations are done on the flowgraph:
+
+* Computing the predecessors of each block
+* Computing edge weights, if profile information is available
+* Computing reachability and dominators
+* Identifying and normalizing loops (transforming while loops to “do while”)
+* Cloning and unrolling of loops
+
+## <a name="normalize-ir"/>Normalize IR for Optimization
+
+At this point, a number of properties are computed on the IR, and must remain valid for the remaining phases. We will call this “normalization”
+
+* `lvaMarkLocalVars` – set the reference counts (raw and weighted) for lvlVars, sort them, and determine which will be tracked (currently up to 128). Note that after this point any transformation that adds or removes lvlVar references must update the reference counts.
+* `optOptimizeBools` – this optimizes Boolean expressions, and may change the flowgraph (why is it not done prior to reachability and dominators?)
+* Link the trees in evaluation order (setting `gtNext` and `gtPrev` fields): and `fgFindOperOrder()` and `fgSetBlockOrder()`.
+
+## <a name="ssa-vn"/>SSA and Value Numbering Optimizations
+
+The next set of optimizations are built on top of SSA and value numbering. First, the SSA representation is built (during which dataflow analysis, aka liveness, is computed on the lclVars), then value numbering is done using SSA.
+
+### <a name="licm"/>Loop Invariant Code Hoisting
+
+This phase traverses all the loop nests, in outer-to-inner order (thus hoisting expressions outside the largest loop in which they are invariant). It traverses all of the statements in the blocks in the loop that are always executed. If the statement is:
+
+* A valid CSE candidate
+* Has no side-effects
+* Does not raise an exception OR occurs in the loop prior to any side-effects
+* Has a valid value number, and it is a lvlVar defined outside the loop, or its children (the value numbers from which it was computed) are invariant.
+
+### <a name="copy-propagation"/>Copy Propagation
+
+This phase walks each block in the graph (in dominator-first order, maintaining context between dominator and child) keeping track of every live definition. When it encounters a variable that shares the VN with a live definition, it is replaced with the variable in the live definition.
+
+The JIT currently requires that the IR be maintained in conventional SSA form, as there is no “out of SSA” translation (see the comments on `optVnCopyProp()` for more information).
+
+### <a name="cse"/>Common Subexpression Elimination (CSE)
+
+Utilizes value numbers to identify redundant computations, which are then evaluated to a new temp lvlVar, and then reused.
+
+### <a name="assertion-propagation"/>Assertion Propagation
+
+Utilizes value numbers to propagate and transform based on properties such as non-nullness.
+
+### <a name="range-analysis"/>Range analysis
+
+Optimize array index range checks based on value numbers and assertions.
+
+## <a name=rationalization"/>Rationalization
+
+As the JIT has evolved, changes have been made to improve the ability to reason over the tree in both “tree order” and “linear order”. These changes have been termed the “rationalization” of the IR. In the spirit of reuse and evolution, some of the changes have been made only in the later (“backend”) components of the JIT. The corresponding transformations are made to the IR by a “Rationalizer” component. It is expected that over time some of these changes will migrate to an earlier place in the JIT phase order:
+
+* Elimination of assignment nodes (`GT_ASG`). The assignment node was problematic because the semantics of its destination (left hand side of the assignment) could not be determined without context. For example, a `GT_LCL_VAR` on the left-hand side of an assignment is a definition of the local variable, but on the right-hand side it is a use. Furthermore, since the execution order requires that the children be executed before the parent, it is unnatural that the left-hand side of the assignment appears in execution order before the assignment operator.
+ * During rationalization, all assignments are replaced by stores, which either represent their destination on the store node itself (e.g. `GT_LCL_VAR`), or by the use of a child address node (e.g. `GT_STORE_IND`).
+* Elimination of address nodes (`GT_ADDR`). These are problematic because of the need for parent context to analyze the child.
+* Elimination of “comma” nodes (`GT_COMMA`). These nodes are introduced for convenience during importation, during which a single tree is constructed at a time, and not incorporated into the statement list until it is completed. When it is necessary, for example, to store a partially-constructed tree into a temporary variable, a `GT_COMMA` node is used to link it into the tree. However, in later phases, these comma nodes are an impediment to analysis, and thus are split into separate statements.
+ * In some cases, it is not possible to fully extract the tree into a separate statement, due to execution order dependencies. In these cases, an “embedded” statement is created. While these are conceptually very similar to the `GT_COMMA` nodes, they do not masquerade as expressions.
+* Elimination of “QMark” (`GT_QMARK`/`GT_COLON`) nodes is actually done at the end of morphing, long before the current rationalization phase. The presence of these nodes made analyses (especially dataflow) overly complex.
+
+For our earlier example (Example of Post-Import IR), here is what the simplified dump looks like just prior to Rationalization (the $ annotations are value numbers). Note that some common subexpressions have been computed into new temporary lvlVars, and that computation has been inserted as a `GT_COMMA` (comma) node in the IR:
+
+ ▌ stmtExpr void (top level) (IL 0x000...0x026)
+ │ ┌──▌ lclVar double V07 cse1 $185
+ │ ┌──▌ comma double $185
+ │ │ │ ┌──▌ dconst double 2.00 $143
+ │ │ │ ┌──▌ \* double $185
+ │ │ │ │ └──▌ lclVar double V00 arg0 u:2 $80
+ │ │ └──▌ = double $VN.Void
+ │ │ └──▌ lclVar double V07 cse1 $185
+ │ ┌──▌ / double $186
+ │ │ │ ┌──▌ unary - double $84
+ │ │ │ │ └──▌ lclVar double V01 arg1 u:2 $81
+ │ │ └──▌ + double $184
+ │ │ │ ┌──▌ lclVar double V06 cse0 $83
+ │ │ └──▌ comma double $83
+ │ │ │ ┌──▌ mathFN double sqrt $83
+ │ │ │ │ │ ┌──▌ lclVar double V02 arg2 u:2 $82
+ │ │ │ │ │ ┌──▌ \* double $182
+ │ │ │ │ │ │ │ ┌──▌ dconst double 4.00 $141
+ │ │ │ │ │ │ └──▌ \* double $181
+ │ │ │ │ │ │ └──▌ lclVar double V00 arg0 u:2 $80
+ │ │ │ │ └──▌ - double $183
+ │ │ │ │ │ ┌──▌ lclVar double V01 arg1 u:2 $81
+ │ │ │ │ └──▌ \* double $180
+ │ │ │ │ └──▌ lclVar double V01 arg1 u:2 $81
+ │ │ └──▌ = double $VN.Void
+ │ │ └──▌ lclVar double V06 cse0 $83
+ └──▌ = double $VN.Void
+ └──▌ indir double $186
+ └──▌ lclVar byref V03 arg3 u:2 (last use) $c0
+
+After rationalization, the nodes are presented in execution order, and the `GT_COMMA` (comma) and `GT_ASG` (=) nodes have been eliminated:
+
+ ▌ stmtExpr void (top level) (IL 0x000... ???)
+ │ ┌──▌ lclVar double V01 arg1
+ │ ├──▌ lclVar double V01 arg1
+ │ ┌──▌ \* double
+ │ │ ┌──▌ lclVar double V00 arg0
+ │ │ ├──▌ dconst double 4.00
+ │ │ ┌──▌ \* double
+ │ │ ├──▌ lclVar double V02 arg2
+ │ ├──▌ \* double
+ │ ┌──▌ - double
+ │ ┌──▌ mathFN double sqrt
+ └──▌ st.lclVar double V06
+
+ ▌ stmtExpr void (top level) (IL 0x000...0x026)
+ │ ┌──▌ lclVar double V06
+ │ │ ┌──▌ lclVar double V01 arg1
+ │ ├──▌ unary - double
+ │ ┌──▌ + double
+ │ │ { ▌ stmtExpr void (embedded) (IL 0x000... ???)
+ │ │ { │ ┌──▌ lclVar double V00 arg0
+ │ │ { │ ├──▌ dconst double 2.00
+ │ │ { │ ┌──▌ \* double
+ │ │ { └──▌ st.lclVar double V07
+ │ ├──▌ lclVar double V07
+ │ ┌──▌ / double
+ │ ├──▌ lclVar byref V03 arg3
+ └──▌ storeIndir double
+
+
+Note that the first operand of the first comma has been extracted into a separate statement, but the second comma causes an embedded statement to be created, in order to preserve execution order.
+
+## <a name="lowering"/>Lowering
+
+Lowering is responsible for transforming the IR in such a way that the control flow, and any register requirements, are fully exposed.
+
+It accomplishes this in two passes.
+
+The first pass is a post-order traversal that performs context-dependent transformations such as expanding switch statements (using a switch table or a series of conditional branches), constructing addressing modes, etc. For example, this:
+
+ ┌──▌ lclVar ref V00 arg0
+ │ ┌──▌ lclVar int V03 loc1
+ │ ┌──▌ cast long <- int
+ │ ├──▌ const long 2
+ ├──▌ << long
+ ┌──▌ + byref
+ ├──▌ const long 16
+ ┌──▌ + byref
+ ┌──▌ indir int
+
+Is transformed into this, in which the addressing mode is explicit:
+
+ ┌──▌ lclVar ref V00 arg0
+ │ ┌──▌ lclVar int V03 loc1
+ ├──▌ cast long <- int
+ ┌──▌ lea(b+(i*4)+16) byref
+ ┌──▌ indir int
+
+The next pass annotates the nodes with register requirements, and this is done in an execution order traversal (effectively post-order) in order to ensure that the children are visited prior to the parent. It may also do some transformations that do not require the parent context, such as determining the code generation strategy for block assignments (e.g. `GT_COPYBLK`) which may become helper calls, unrolled loops, or an instruction like rep stos.
+
+The register requirements are expressed in the `TreeNodeInfo` (`gtLsraInfo`) for each node. For example, for the `copyBlk` node in this snippet:
+
+ Source │ ┌──▌ const(h) long 0xCA4000 static
+ Destination │ ├──▌ &lclVar byref V04 loc4
+ │ ├──▌ const int 34
+ └──▌ copyBlk void
+
+The `TreeNodeInfo` would be as follows:
+
+ +<TreeNodeInfo @ 15 0=1 1i 1f
+ src=[allInt]
+ int=[rax rcx rdx rbx rbp rsi rdi r8-r15 mm0-mm5]
+ dst=[allInt] I>
+
+The “@ 15” is the location number of the node. The “0=1” indicates that there are zero destination registers (because this defines only memory), and 1 source register (the address of lclVar V04). The “1i” indicates that it requires 1 internal integer register (for copying the remainder after copying 16-byte sized chunks), the “1f” indicates that it requires 1 internal floating point register (for copying the two 16-byte chunks). The src, int and dst fields are encoded masks that indicate the register constraints for the source, internal and destination registers, respectively.
+
+## <a name="reg-alloc"/>Register allocation
+
+The RyuJIT register allocator uses a Linear Scan algorithm, with an approach similar to [[2]](#[2]). In brief, it operates on two main data structures:
+
+* `Intervals` (representing live ranges of variables or tree expressions) and `RegRecords` (representing physical registers), both of which derive from `Referent`.
+* `RefPositions`, which represent uses or defs (or variants thereof, such as ExposedUses) of either `Intervals` or physical registers.
+
+Pre-conditions:
+
+* The `NodeInfo` is initialized for each tree node to indicate:
+ * Number of registers consumed and produced by the node.
+ * Number and type (int versus float) of internal registers required.
+
+Allocation proceeds in 4 phases:
+
+* Determine the order in which the `BasicBlocks` will be allocated, and which predecessor of each block will be used to determine the starting location for variables live-in to the `BasicBlock`.
+* Construct Intervals for each tracked lvlVar, then walk the `BasicBlocks` in the determined order building `RefPositions` for each register use, def, or kill.
+* Allocate the registers by traversing the `RefPositions`.
+* Write back the register assignments, and perform any necessary moves at block boundaries where the allocations don’t match.
+
+Post-conditions:
+
+* The `gtRegNum` property of all `GenTree` nodes that require a register has been set to a valid register number.
+* The `gtRsvdRegs` field (a set/mask of registers) has the requested number of registers specified for internal use.
+* All spilled values (lvlVar or expression) are marked with `GTF_SPILL` at their definition. For lvlVars, they are also marked with `GTF_SPILLED` at any use at which the value must be reloaded.
+* For all lvlVars that are register candidates:
+ * `lvRegNum` = initial register location (or `REG_STK`)
+ * `lvRegister` flag set if it always lives in the same register
+ * `lvSpilled` flag is set if it is ever spilled
+* The maximum number of simultaneously-live spill locations of each type (used for spilling expression trees) has been communicated via calls to `compiler->tmpPreAllocateTemps(type)`.
+
+## <a name="code-generation"/>Code Generation
+
+The process of code generation is relatively straightforward, as Lowering has done some of the work already. Code generation proceeds roughly as follows:
+
+* Determine the frame layout – allocating space on the frame for any lvlVars that are not fully enregistered, as well as any spill temps required for spilling non-lvlVar expressions.
+* For each `BasicBlock`, in layout order, and each `GenTree` node in the block, in execution order:
+ * If the node is “contained” (i.e. its operation is subsumed by a parent node), do nothing.
+ * Otherwise, “consume” all the register operands of the node.
+ * This updates the liveness information (i.e. marking a lvlVar as dead if this is the last use), and performs any needed copies.
+ * This must be done in correct execution order, obeying any reverse flags (GTF_REVERSE_OPS) on the operands, so that register conflicts are handled properly.
+ * Track the live variables in registers, as well as the live stack variables that contain GC refs.
+ * Produce the `instrDesc(s)` for the operation, with the current live GC references.
+ * Update the scope information (debug info) at block boundaries.
+* Generate the prolog and epilog code.
+* Write the final instruction bytes. It does this by invoking the emitter, which holds all the `instrDescs`.
+
+# Phase-dependent Properties and Invariants of the IR
+
+There are several properties of the IR that are valid only during (or after) specific phases of the JIT. This section describes the phase transitions, and how the IR properties are affected.
+
+## Phase Transitions
+
+* Flowgraph analysis
+ * Sets the predecessors of each block, which must be kept valid after this phase.
+ * Computes reachability and dominators. These may be invalidated by changes to the flowgraph.
+ * Computes edge weights, if profile information is available.
+ * Identifies and normalizes loops. These may be invalidated, but must be marked as such.
+* Normalization
+ * The lvlVar reference counts are set by `lvaMarkLocalVars()`.
+ * Statement ordering is determined by `fgSetBlockOrder()`. Execution order is a depth-first preorder traversal of the nodes, with the operands usually executed in order. The exceptions are:
+ * Commutative operators, which can have the `GTF_REVERSE_OPS` flag set to indicate that op2 should be evaluated before op1.
+ * Assignments, which can also have the `GTF_REVERSE_OPS` flag set to indicate that the rhs (op2) should be evaluated before the target address (if any) on the lhs (op1) is evaluated. This can only be done if there are no side-effects in the expression for the lhs.
+* Rationalization
+ * All `GT_COMMA` nodes are split into separate statements, which may be embedded in other statements in execution order.
+ * All `GT_ASG` trees are transformed into `GT_STORE` variants (e.g. `GT_STORE_LCL_VAR`).
+ * All `GT_ADDR` nodes are eliminated (e.g. with `GT_LCL_VAR_ADDR`).
+* Lowering
+ * `GenTree` nodes are split or transformed as needed to expose all of their register requirements and any necessary `flowgraph` changes (e.g., for switch statements).
+
+## GenTree phase-dependent properties
+
+Ordering:
+
+* For `GenTreeStmt` nodes, the `gtNext` and `gtPrev` fields must always be consistent. The last statement in the `BasicBlock` must have `gtNext` equal to null. By convention, the `gtPrev` of the first statement in the `BasicBlock` must be the last statement of the `BasicBlock`.
+ * In all phases, `gtStmtExpr` points to the top-level node of the expression.
+* For non-statement nodes, the `gtNext` and `gtPrev` fields are either null, prior to ordering, or they are consistent (i.e. `A->gtPrev->gtNext = A`, and `A->gtNext->gtPrev == A`, if they are non-null).
+* After normalization the `gtStmtList` of the containing statement points to the first node to be executed.
+* Prior to normalization, the `gtNext` and `gtPrev` pointers on the expression (non-statement) `GenTree` nodes are invalid. The expression nodes are only traversed via the links from parent to child (e.g. `node->gtGetOp1()`, or `node->gtOp.gtOp1`). The `gtNext/gtPrev` links are set by `fgSetBlockOrder()`.
+ * After normalization, and prior to rationalization, the parent/child links remain the primary traversal mechanism. The evaluation order of any nested expression-statements (usually assignments) is enforced by the `GT_COMMA` in which they are contained.
+* After rationalization, all `GT_COMMA` nodes are eliminated, and the primary traversal mechanism becomes the `gtNext/gtPrev` links. Statements may be embedded within other statements, but the nodes of each statement preserve the valid traversal order.
+* In tree ordering:
+ * The `gtPrev` of the first node (`gtStmtList`) is always null.
+ * The `gtNext` of the last node (`gtStmtExpr`) is always null.
+* In linear ordering:
+ * The nodes of each statement are ordered such that `gtStmtList` is encountered first, and `gtStmtExpr` is encountered last.
+ * The nodes of an embedded statement S2 (starting with `S2->gtStmtList`) appear in the ordering after a node from the “containing” statement S1, and no other node from S1 will appear in the list prior to the `gtStmtExpr` of S2. However, there may be multiple levels of nesting of embedded statements.
+
+TreeNodeInfo:
+
+* The `TreeNodeInfo` (`gtLsraInfo`) is set during the Lowering phase, and communicates the register requirements of the node, including the number and types of registers used as sources, destinations and internal registers. Currently only a single destination per node is supported.
+
+## LclVar phase-dependent properties
+
+Prior to normalization, the reference counts (`lvRefCnt` and `lvRefCntWtd`) are not valid. After normalization they must be updated when lvlVar references are added or removed.
+
+# Supporting technologies and components
+
+## Instruction encoding
+
+Instruction encoding is performed by the emitter ([emit.h](https://github.com/dotnet/coreclr/blob/master/src/jit/emit.h)), using the `insGroup`/`instrDesc` representation. The code generator calls methods on the emitter to construct `instrDescs`. The encodings information is captured in the following:
+
+* The “instruction” enumeration itemizes the different instructions available on each target, and is used as an index into the various encoding tables (e.g. `instInfo[]`, `emitInsModeFmtTab[]`) generated from the `instrs{tgt}.h` (e.g., [instrsxarch.h](https://github.com/dotnet/coreclr/blob/master/src/jit/instrsxarch.h)).
+* The skeleton encodings are contained in the tables, and then there are methods on the emitter that handle the special encoding constraints for the various instructions, addressing modes, register types, etc.
+
+## GC Info
+
+Reporting of live GC references is done in two ways:
+
+* For stack locations that are not tracked (these could be spill locations or lvlVars – local variables or temps – that are not register candidates), they are initialized to null in the prolog, and reported as live for the entire method.
+* For lvlVars with tracked lifetimes, or for expression involving GC references, we report the range over which the reference is live. This is done by the emitter, which adds this information to the instruction group, and which terminates instruction groups when the GC info changes.
+
+The tracking of GC reference lifetimes is done via the `GCInfo` class in the JIT. It is declared in [src/jit/jitgcinfo.h](https://github.com/dotnet/coreclr/blob/master/src/jit/jitgcinfo.h) (to differentiate it from [src/inc/gcinfo.h](https://github.com/dotnet/coreclr/blob/master/src/inc/gcinfo.h)), and implemented in [src/jit/gcinfo.cpp](https://github.com/dotnet/coreclr/blob/master/src/jit/gcinfo.cpp).
+
+In a JitDump, the generated GC info can be seen following the “In gcInfoBlockHdrSave()” line.
+
+## Debugger info
+
+Debug info consists primarily of two types of information in the JIT:
+
+* Mapping of IL offsets to native code offsets. This is accomplished via:
+ * the `gtStmtILoffsx` on the statement nodes (`GenTreeStmt`)
+ * the `gtLclILoffs` on lvlVar references (`GenTreeLclVar`)
+ * The IL offsets are captured during CodeGen by calling `CodeGen::genIPmappingAdd()`, and then written to debug tables by `CodeGen::genIPmappingGen()`.
+* Mapping of user locals to location (register or stack). This is accomplished via:
+ * Struct `siVarLoc` (in [compiler.h](https://github.com/dotnet/coreclr/blob/master/src/jit/compiler.h)) captures the location
+ * `VarScopeDsc` ([compiler.h](https://github.com/dotnet/coreclr/blob/master/src/jit/compiler.h)) captures the live range of a local variable in a given location.
+
+## Exception handling
+
+Exception handling information is captured in an `EHblkDsc` for each exception handling region. Each region includes the first and last blocks of the try and handler regions, exception type, enclosing region, among other things. Look at [jiteh.h](https://github.com/dotnet/coreclr/blob/master/src/jit/jiteh.h) and [jiteh.cpp](https://github.com/dotnet/coreclr/blob/master/src/jit/jiteh.cpp), especially, for details. Look at `Compiler::fgVerifyHandlerTab()` to see how the exception table constraints are verified.
+
+# Reading a JitDump
+
+One of the best ways of learning about the JIT compiler is examining a compilation dump in detail. The dump shows you all the really important details of the basic data structures without all the implementation detail of the code. Debugging a JIT bug almost always begins with a JitDump. Only after the problem is isolated by the dump does it make sense to start debugging the JIT code itself.
+
+Dumps are also useful because they give you good places to place breakpoints. If you want to see what is happening at some point in the dump, simply search for the dump text in the source code. This gives you a great place to put a conditional breakpoint.
+
+There is not a strong convention about what or how the information is dumped, but generally you can find phase-specific information by searching for the phase name. Some useful points follow.
+
+## How to create a JitDump
+
+You can enable dumps by setting the `COMPlus_JitDump` environment variable to a space-separated list of the method(s) you want to dump. For example:
+
+```cmd
+:: Print out lots of useful info when
+:: compiling methods named Main/GetEnumerator
+set "COMPlus_JitDump=Main GetEnumerator"
+```
+
+See [Setting configuration variables](../building/viewing-jit-dumps.md#setting-configuration-variables) for more details on this.
+
+Full instructions for dumping the compilation of some managed code can be found here: [viewing-jit-dumps.md](../building/viewing-jit-dumps.md)
+
+## Reading expression trees
+
+It takes some time to learn to “read” the expression trees, which are printed with the children indented from the parent, and, for binary operators, with the first operand below the parent and the second operand above.
+
+Here is an example dump
+
+ [000027] ------------ ▌ stmtExpr void (top level) (IL 0x010... ???)
+ [000026] --C-G------- └──▌ return double
+ [000024] --C-G------- └──▌ call double BringUpTest.DblSqrt
+ [000021] ------------ │ ┌──▌ lclVar double V02 arg2
+ [000022] ------------ │ ┌──▌ - double
+ [000020] ------------ │ │ └──▌ lclVar double V03 loc0
+ [000023] ------------ arg0 └──▌ * double
+ [000017] ------------ │ ┌──▌ lclVar double V01 arg1
+ [000018] ------------ │ ┌──▌ - double
+ [000016] ------------ │ │ └──▌ lclVar double V03 loc0
+ [000019] ------------ └──▌ * double
+ [000013] ------------ │ ┌──▌ lclVar double V00 arg0
+ [000014] ------------ │ ┌──▌ - double
+ [000012] ------------ │ │ └──▌ lclVar double V03 loc0
+ [000015] ------------ └──▌ * double
+ [000011] ------------ └──▌ lclVar double V03 loc0
+
+The tree nodes are indented to represent the parent-child relationship. Binary operators print first the right hand side, then the operator node itself, then the left hand side. This scheme makes sense if you look at the dump “sideways” (lean your head to the left). Oriented this way, the left hand side operator is actually on the left side, and the right hand operator is on the right side so you can almost visualize the tree if you look at it sideways. The indentation level is also there as a backup.
+
+Tree nodes are identified by their `gtTreeID`. This field only exists in DEBUG builds, but is quite useful for debugging, since all tree nodes are created from the routine `gtNewNode` (in [src/jit/gentree.cpp](https://github.com/dotnet/coreclr/blob/master/src/jit/gentree.cpp)). If you find a bad tree and wish to understand how it got corrupted, you can place a conditional breakpoint at the end of `gtNewNode` to see when it is created, and then a data breakpoint on the field that you believe is corrupted.
+
+The trees are connected by line characters (either in ASCII, by default, or in slightly more readable Unicode when `COMPlus_JitDumpAscii=0` is specified), to make it a bit easier to read.
+
+ N037 ( 0, 0) [000391] ----------L- arg0 SETUP │ ┌──▌ argPlace ref REG NA $1c1
+ N041 ( 2, 8) [000389] ------------ │ │ ┌──▌ const(h) long 0xB410A098 REG rcx $240
+ N043 ( 4, 10) [000390] ----G------- │ │ ┌──▌ indir ref REG rcx $1c1
+ N045 ( 4, 10) [000488] ----G------- arg0 in rcx │ ├──▌ putarg_reg ref REG rcx
+ N049 ( 18, 16) [000269] --C-G------- └──▌ call void System.Diagnostics.TraceInternal.Fail $VN.Void
+
+## Variable naming
+
+The dump uses the index into the local variable table as its name. The arguments to the function come first, then the local variables, then any compiler generated temps. Thus in a function with 2 parameters (remember “this” is also a parameter), and one local variable, the first argument would be variable 0, the second argument variable 1, and the local variable would be variable 2. As described earlier, tracked variables are given a tracked variable index which identifies the bit for that variable in the dataflow bit vectors. This can lead to confusion as to whether the variable number is its index into the local variable table, or its tracked index. In the dumps when we refer to a variable by its local variable table index we use the ‘V’ prefix, and when we print the tracked index we prefix it by a ‘T’.
+
+## References
+
+<a name="[1]"/>
+[1] P. Briggs, K. D. Cooper, T. J. Harvey, and L. T. Simpson, "Practical improvements to the construction and destruction of static single assignment form," Software --- Practice and Experience, vol. 28, no. 8, pp. 859---881, Jul. 1998.
+
+<a name="[2]"/>
+[2] Wimmer, C. and Mössenböck, D. "Optimized Interval Splitting in a Linear Scan Register Allocator," ACM VEE 2005, pp. 132-141. [http://portal.acm.org/citation.cfm?id=1064998&dl=ACM&coll=ACM&CFID=105967773&CFTOKEN=80545349](http://portal.acm.org/citation.cfm?id=1064998&dl=ACM&coll=ACM&CFID=105967773&CFTOKEN=80545349)
diff --git a/Documentation/botr/stackwalking.md b/Documentation/botr/stackwalking.md
new file mode 100644
index 0000000000..a976aa8c5b
--- /dev/null
+++ b/Documentation/botr/stackwalking.md
@@ -0,0 +1,85 @@
+Stackwalking in the CLR
+===
+
+Author: Rudi Martin ([@Rudi-Martin](https://github.com/Rudi-Martin)) - 2008
+
+The CLR makes heavy use of a technique known as stack walking (or stack crawling). This involves iterating the sequence of call frames for a particular thread, from the most recent (the thread's current function) back down to the base of the stack.
+
+The runtime uses stack walks for a number of purposes:
+
+- The runtime walks the stacks of all threads during garbage collection, looking for managed roots (local variables holding object references in the frames of managed methods that need to be reported to the GC to keep the objects alive and possibly track their movement if the GC decides to compact the heap).
+- On some platforms the stack walker is used during the processing of exceptions (looking for handlers in the first pass and unwinding the stack in the second).
+- The debugger uses the functionality when generating managed stack traces.
+- Various miscellaneous methods, usually those close to some public managed API, perform a stack walk to pick up information about their caller (such as the method, class or assembly of that caller).
+
+# The Stack Model
+
+Here we define some common terms and describe the typical layout of a thread's stack.
+
+Logically, a stack is divided up into some number of _frames_. Each frame represents some function (managed or unmanaged) that is either currently executing or has called into some other function and is waiting for it to return. A frame contains state required by the specific invocation of its associated function. Typically this includes space for local variables, pushed arguments for a call to another function, saved caller registers etc.
+
+The exact definition of a frame varies from platform to platform and on many platforms there isn't a hard definition of a frame format that all functions adhere to (x86 is an example of this). Instead the compiler is often free to optimize the exact format of frames. On such systems it is not possible to guarantee that a stackwalk will return 100% correct or complete results (for debugging purposes, debug symbols such as pdbs are used to fill in the gaps so that debuggers can generate more accurate stack traces).
+
+This is not a problem for the CLR, however, since we do not require a fully generalized stack walk. Instead we are only interested in those frames that are managed (i.e. represent a managed method) or, to some extent, frames coming from unmanaged code used to implement part of the runtime itself. In particular there is no guarantee about fidelity of 3rd party unmanaged frames other than to note where such frames transition into or out of the runtime itself (i.e. one of the frame types we do care about).
+
+Because we control the format of the frames we're interested in (we'll delve into the details of this later) we can ensure that those frames are crawlable with 100% fidelity. The only additional requirement is a mechanism to link disjoint groups of runtime frames together such that we can skip over any intervening unmanaged (and otherwise uncrawlable) frames.
+
+The following diagram illustrates a stack containing all the frames types (note that this document uses a convention where stacks grow towards the top of the page):
+
+![image](../images/stack.png)
+
+# Making Frames Crawlable
+
+## Managed Frames
+
+Because the runtime owns and controls the JIT (Just-in-Time compiler) it can arrange for managed methods to always leave a crawlable frame. One solution here would be to utilize a rigid frame format for all methods (e.g. the x86 EBP frame format). In practice, however, this can be inefficient, especially for small leaf methods (such as typical property accessors).
+
+Since methods are typically called more times than their frames are crawled (stack crawls are relatively rare in the runtime, at least with respect to the rate at which methods are typically called) it makes sense to trade method call performance for some additional crawl time processing. As a result the JIT generates additional metadata for each method it compiles that includes sufficient information for the stack crawler to decode a stack frame belonging to that method.
+
+This metadata can be found via a hash-table lookup with an instruction pointer somewhere within the method as the key. The JIT utilizes compression techniques in order to minimize the impact of this additional per-method metadata.
+
+Given initial values for a few important registers (e.g. EIP, ESP and EBP on x86 based systems) the stack crawler can locate a managed method and its associated JIT metadata and use this information to roll back the register values to those current in the method's caller. In this fashion a sequence of managed method frames can be traversed from the most recent to the oldest caller. This operation is sometimes referred to as a _virtual unwind_ (virtual because we're not actually updating the real values of ESP etc., leaving the stack intact).
+
+## Runtime Unmanaged Frames
+
+The runtime is partially implemented in unmanaged code (e.g. coreclr.dll). Most of this code is special in that it operates as _manually managed_ code. That is, it obeys many of the rules and protocols of managed code but in an explicitly controlled fashion. For instance such code can explicitly enable or disable GC pre-emptive mode and needs to manage its use of object references accordingly.
+
+Another area where this careful interaction with managed code comes into play is during stackwalks. Since the majority of the runtime's unmanaged code is written in C++ we don't have the same control over method frame format as managed code. At the same time there are many instances where runtime unmanaged frames contain information that is important during a stack walk. These include cases where unmanaged functions hold object references in local variables (which must be reported during garbage collections) and exception processing.
+
+Rather than attempt to make each unmanaged frame crawable, unmanaged functions with interesting data to report to stack crawls bundle up the information into a data structure called a Frame. The choice of name is unfortunate as it can lead to ambiguity in stack related discussions. This document will always refer to the data structure variant as a capitalized Frame.
+
+Frame is actually the abstract base class of an entire hierarchy of Frame types. Frame is sub-typed in order to express different types of information that might be interesting to a stack walk.
+
+But how does the stack walker find these Frames and how do they relate to the frames utilized by managed methods?
+
+Each Frame is part of a singly linked list, having a next pointer to the next oldest Frame on this thread's stack (or null if the Frame is the oldest). The CLR Thread structure holds a pointer to the newest Frame. Unmanaged runtime code can push or pop Frames as needed by manipulating the Thread structure and Frame list.
+
+In this fashion the stack walker can iterate unmanaged Frames in newest to oldest order (the same order in which managed frames are iterated). But managed and unmanaged methods can be interleaved, and it would be wrong to process all managed frames followed by unmanaged Frames or vice versa since that would not accurately represent the real calling sequence.
+
+To solve this problem Frames are further restricted in that they must be allocated on the stack in the frame of the method that pushes them onto the Frame list. Since the stack walker knows the stack bounds of each managed frame it can perform simple pointer comparisons to determine whether a given Frame is older or newer than a given managed frame.
+
+Essentially the stack walker, having decoded the current frame, always has two possible choices for the next (older) frame: the next managed frame determined via a virtual unwind of the register set or the next oldest Frame on the Thread's Frame list. It can decide which is appropriate by determining which occupies stack space nearer the stack top. The actual calculation involved is platform dependent but usually devolves to one or two pointer comparisons.
+
+When managed code calls into the unmanaged runtime one of several forms of transition Frame is often pushed by the unmanaged target method. This is needed both to record the register state of the calling managed method (so that the stack walker can resume virtual unwinding of managed frames once it has finished enumerating the unmanaged Frames) and in many cases because managed object references are passed as arguments to the unmanaged method and must be reported to the GC in the event of a garbage collection.
+
+A full description of the available Frame types and their uses is beyond the scope of the document. Further details can be found in the [frames.h](https://github.com/dotnet/coreclr/blob/master/src/vm/frames.h) header file.
+
+# Stackwalker Interface
+
+The full stack walk interface is exposed to runtime unmanaged code only (a simplified subset is available to managed code via the System.Diagnostics.StackTrace class). The typical entrypoint is via the StackWalkFramesEx() method on the runtime Thread class.
+
+The caller of this method provides three main inputs:
+
+1. Some context indicating the starting point of the walk. This is either an initial register set (for instance if you've suspended the target thread and can call GetThreadContext() on it) or an initial Frame (in cases where you know the code in question is in runtime unmanaged code). Although most stack walks are made from the top of the stack it's possible to start lower down if you can determine the correct starting context.
+2. A function pointer and associated context. The function provided is called by the stack walker for each interesting frame (in order from the newest to the oldest). The context value provided is passed to each invocation of the callback so that it can record or build up state during the walk.
+3. Flags indicating what sort of frames should trigger a callback. This allows the caller to specify that only pure managed method frames should be reported for instance. For a full list see [threads.h](https://github.com/dotnet/coreclr/blob/master/src/vm/threads.h) (just above the declaration of StackWalkFramesEx()).
+
+StackWalkFramesEx() returns an enum value that indicates whether the walk terminated normally (got to the stack base and ran out of methods to report), was aborted by one of the callbacks (the callbacks return an enum of the same type to the stack walk to control this) or suffered some other miscellaneous error.
+
+Aside from the context value passed to StackWalkFramesEx(), stack callback functions are passed one other piece of context: the CrawlFrame. This class is defined in [stackwalk.h](https://github.com/dotnet/coreclr/blob/master/src/vm/stackwalk.h) and contains all sorts of context gathered as the stack walk proceeds.
+
+For instance the CrawlFrame indicates the MethodDesc* for managed frames and the Frame* for unmanaged Frames. It also provides the current register set inferred by virtually unwinding frames up to that point.
+
+# Stackwalk Implementation Details
+
+Further low-level details of the stack walk implementation are currently outside the scope of this document. If you have knowledge of these and would care to share that knowledge please feel free to update this document.
diff --git a/Documentation/botr/threading.md b/Documentation/botr/threading.md
new file mode 100644
index 0000000000..2e13d52df3
--- /dev/null
+++ b/Documentation/botr/threading.md
@@ -0,0 +1,210 @@
+CLR Threading Overview
+======================
+
+Managed vs. Native Threads
+==========================
+
+Managed code executes on "managed threads," which are distinct from the native threads provided by the operating system. A native thread is a thread of execution of native code on a physical machine; a managed thread is a virtual thread of execution on the CLR's virtual machine.
+
+Just as the JIT compiler maps "virtual" IL instructions into native instructions that execute on the physical machine, the CLR's threading infrastructure maps "virtual" managed threads onto the native threads provided by the operating system.
+
+At any given time, a managed thread may or may not be assigned to a native thread for execution. For example, a managed thread that has been created (via "new System.Threading.Thread") but not yet started (via System.Threading.Thread.Start) is a managed thread that has not yet been assigned to a native thread. Similarly, a managed thread may, in principle, move between multiple native threads over the course of its execution, though in practice the CLR does not currently support this.
+
+The public Thread interface available to managed code intentionally hides the details of the underlying native threads. because:
+
+- Managed threads are not necessarily mapped to a single native thread (and may not be mapped to a native thread at all).
+- Different operating systems expose different abstractions for native threads.
+- In principle, managed threads are "virtualized".
+
+The CLR provide equivalent abstractions for managed threads, implemented by the CLR itself. For example, it does not expose the operating system's thread-local storage (TLS) mechanism, but instead provide managed "thread-static" variables. Similarly, it does not expose the native thread's "thread ID," but instead provide a "managed thread ID" which is generated independently of the OS. However, for diagnostic purposes, some details of the underlying native thread may be obtained via types in the System.Diagnostics namespace.
+
+Managed threads require additional functionality typically not needed by native threads. First, managed threads hold GC references on their stacks, so the CLR must be able to enumerate (and possibly modify) these references every time a GC occurs. To do this, the CLR must "suspend" each managed thread (stop it at a point where all of its GC references can be found). Second, when an AppDomain is unloaded, the CLR must ensure that no thread is executing code in that AppDomain. This requires the ability to force a thread to unwind out of that AppDomain. The CLR does this by injecting a ThreadAbortException into such threads.
+
+Data Structures
+===============
+
+Every managed thread has an associated Thread object, defined in [threads.h][threads.h]. This object tracks everything the VM needs to know about the managed thread. This includes things that are _necessary_, such as the thread's current GC mode and Frame chain, as well as many things that are allocated per-thread simply for performance reasons (such as some fast arena-style allocators).
+
+All Thread objects are stored in the ThreadStore (also defined in [threads.h][threads.h]), which is a simple list of all known Thread objects. To enumerate all managed threads, one must first acquire the ThreadStoreLock, then use ThreadStore::GetAllThreadList to enumerate all Thread objects. This list may include managed threads which are not currently assigned to native threads (for example, they may not yet be started, or the native thread may already have exited).
+
+[threads.h]: (https://github.com/dotnet/coreclr/blob/master/src/vm/threads.h)
+
+Each managed thread that is currently assigned to a native thread is reachable via a native thread-local storage (TLS) slot on that native thread. This allows code that is executing on that native thread to get the corresponding Thread object, via GetThread().
+
+Additionally, many managed threads have a _managed_ Thread object (System.Threading.Thread) which is distinct from the native Thread object. The managed Thread object provides methods for managed code to interact with the thread, and is mostly a wrapper around functionality offered by the native Thread object. The current managed Thread object is reachable (from managed code) via Thread.CurrentThread.
+
+In a debugger, the SOS extension command "!Threads" can be used to enumerate all Thread objects in the ThreadStore.
+
+Thread Lifetimes
+================
+
+A managed thread is created in the following situations:
+
+1. Managed code explicitly asks the CLR to create a new thread via System.Threading.Thread.
+2. The CLR creates the managed thread directly (see "special threads" below).
+3. Native code calls managed code on a native thread which is not yet associated with a managed thread (via "reverse p/invoke" or COM interop).
+4. A managed process starts (invoking its Main method on the process' Main thread).
+
+In cases #1 and #2, the CLR is responsible for creating a native thread to back the managed thread. This is not done until the thread is actually _started_. In such cases, the native thread is "owned" by the CLR; the CLR is responsible for the native thread's lifetime. In these cases, the CLR is aware of the existence of the thread by virtue of the fact that the CLR created it in the first place.
+
+In cases #3 and #4, the native thread already existed prior to the creation of the managed thread, and is owned by code external to the CLR. The CLR is not responsible for the native thread's lifetime. The CLR becomes aware of these threads the first time they attempt to call managed code.
+
+When a native thread dies, the CLR is notified via its DllMain function. This happens inside of the OS "loader lock," so there is little that can be done (safely) while processing this notification. So rather than destroying the data structures associated with the managed thread, the thread is simply marked as "dead" and signals the finalizer thread to run. The finalizer thread then sweeps through the threads in the ThreadStore and destroys any that are both dead _and_ unreachable via managed code.
+
+Suspension
+==========
+
+The CLR must be able to find all references to managed objects in order to perform a GC. Managed code is constantly accessing the GC heap, and manipulating references stored on the stack and in registers. The CLR must ensure that all managed threads are stopped (so they aren't modifying the heap) to safely and reliably find all managed objects. It only stops at _safe point_, when registers and stack locations can be inspected for live references.
+
+Another way of putting this is that the GC heap, and every thread's stack and register state, is "shared state," accessed by multiple threads. As with most shared state, some sort of "lock" is required to protect it. Managed code must hold this lock while accessing the heap, and can only release the lock at safe points.
+
+The CLR refers to this "lock" as the thread's "GC mode." A thread which is in "cooperative mode" holds its lock; it must "cooperate" with the GC (by releasing the lock) in order for a GC to proceed. A thread which is in "preemptive" mode does not hold its lock – the GC may proceed "preemptively" because the thread is known to not be accessing the GC heap.
+
+A GC may only proceed when all managed threads are in "preemptive" mode (not holding the lock). The process of moving all managed threads to preemptive mode is known as "GC suspension" or "suspending the Execution Engine (EE)."
+
+A naïve implementation of this "lock" would be for each managed thread to actually acquire and release a real lock around each access to the GC heap. Then the GC would simply attempt to acquire the lock on each thread; once it had acquired all threads' locks, it would be safe to perform the GC.
+
+However, this naïve approach is unsatisfactory for two reasons. First, it would require managed code to spend a lot of time acquiring and releasing the lock (or at least checking whether the GC was attempting to acquire the lock – known as "GC polling.") Second, it would require the JIT to emit "GC info" describing the layout of the stack and registers for every point in JIT'd code; this information would consume large amounts of memory.
+
+We refined this naïve approach by separating JIT'd managed code into "partially interruptible" and "fully interruptible" code. In partially interruptible code, the only safe points are calls to other methods, and explicit "GC poll" locations where the JIT emits code to check whether a GC is pending. GC info need only be emitted for these locations. In fully interruptible code, every instruction is a safe point, and the JIT emits GC info for every instruction – but it does not emit GC polls. Instead, fully interruptible code may be "interrupted" by hijacking the thread (a process which is discussed later in this document). The JIT chooses whether to emit fully- or partially-interruptible code based on heuristics to find the best tradeoff between code quality, size of the GC info, and GC suspension latency.
+
+Given the above, there are three fundamental operations to define: entering cooperative mode, leaving cooperative mode, and suspending the EE.
+
+Entering Cooperative Mode
+-------------------------
+
+A thread enters cooperative mode by calling Thread::DisablePreemptiveGC. This acquires the "lock" for the current thread, as follows:
+
+1. If a GC is in progress (the GC holds the lock) then block until the GC is complete.
+2. Mark the thread as being in cooperative mode. No GC may proceed until the thread reenters preemptive mode.
+
+These two steps proceed as if they were atomic.
+
+Entering Preemptive Mode
+------------------------
+
+A thread enters preemptive mode (releases the lock) by calling Thread::EnablePreemptiveGC. This simply marks the thread as no longer being in cooperative mode, and informs the GC thread that it may be able to proceed.
+
+Suspending the EE
+-----------------
+
+When a GC needs to occur, the first step is to suspend the EE. This is done by GCHeap::SuspendEE, which proceeds as follows:
+
+1. Set a global flag (g\_fTrapReturningThreads) to indicate that a GC is in progress. Any threads that attempt to enter cooperative mode will block until the GC is complete.
+2. Find all threads currently executing in cooperative mode. For each such thread, attempt to hijack the thread and force it to leave cooperative mode.
+3. Repeat until no threads are running in cooperative mode.
+
+Hijacking
+---------
+
+Hijacking for GC suspension is done by Thread::SysSuspendForGC. This method attempts to force any managed thread that is currently running in cooperative mode, to leave cooperative mode at a "safe point." It does this by enumerating all managed threads (walking the ThreadStore), and for each managed thread currently running in cooperative mode.
+
+1. Suspend the underlying native thread. This is done with the Win32 SuspendThread API. This API forcibly stops the thread from running, at some random point in its execution (not necessarily a safe point).
+2. Get the current CONTEXT for the thread, via GetThreadContext. This is an OS concept; CONTEXT represents the current register state of the thread. This allows us to inspect its instruction pointer, and thus determine what type of code it is currently executing.
+3. Check again if the thread is in cooperative mode, as it may have already left cooperative mode before it could be suspended. If so, the thread is in dangerous territory: the thread may be executing arbitrary native code, and must be resumed immediately to avoid deadlocks.
+4. Check if the thread is running managed code. It is possible that it is executing native VM code in cooperative mode (see Synchronization, below), in which case the thread must be immediately resumed as in the previous step.
+5. Now the thread is suspended in managed code. Depending on whether that code is fully- or partially-interruptable, one of the following is performed:
+ * If fully interruptable, it is safe to perform a GC at any point, since the thread is, by definition, at a safe point. It is reasonable to leave the thread suspended at this point (because it's safe) but various historical OS bugs prevent this from working, because the CONTEXT retrieved earlier may be corrupt). Instead, the thread's instruction pointer is overwritten, redirecting it to a stub that will capture a more complete CONTEXT, leave cooperative mode, wait for the GC to complete, reenter cooperative mode, and restore the thread to its previous state.
+ * If partially-interruptable, the thread is, by definition, not at a safe point. However, the caller will be at a safe point (method transition). Using that knowledge, the CLR "hijacks" the top-most stack frame's return address (physically overwrite that location on the stack) with a stub similar to the one used for fully-interruptable code. When the method returns, it will no longer return to its actual caller, but rather to the stub (the method may also perform a GC poll, inserted by the JIT, before that point, which will cause it to leave cooperative mode and undo the hijack).
+
+ThreadAbort / AppDomain-Unload
+==============================
+
+In order to unload an AppDomain, the CLR must ensure that no thread is running in that AppDomain. To accomplish this, all managed threads are enumerated, and "abort" any threads which have stack frames belonging to the AppDomain being unloaded. A ThreadAbortException is "injected" into the running thread, which causes the thread to unwind (executing backout code along the way) until it is no longer executing in the AppDomain, at which point the ThreadAbortException is translated into an AppDomainUnloaded exception.
+
+ThreadAbortException is a special type of exception. It can be caught by user code, but the CLR ensures that the exception will be rethrown after the user's exception handler is executed. Thus ThreadAbortException is sometimes referred to as "uncatchable," though this is not strictly true.
+
+A ThreadAbortException is typically 'thrown' by simply setting a bit on the managed thread marking it as "aborting." This bit is checked by various parts of the CLR (most notably, every return from a p/invoke) and often times setting this bit is all that is needed to get the thread aborted in a timely manner.
+
+However, if the thread is, for example, executing a long-running managed loop, it may never check this bit. To get such a thread to abort faster, the thread i "hijacked" and forced to raise a ThreadAbortException. This hijacking is done in the same way as GC suspension, except that the stubs that the thread is redirected to will cause a ThreadAbortException to be raised, rather than waiting for a GC to complete.
+
+This hijacking means that a ThreadAbortException can be raised at essentially any arbitrary point in managed code. This makes it extremely difficult for managed code to deal successfully with a ThreadAbortException. It is therefore unwise to use this mechanism for any purpose other than AppDomain-Unload, which ensures that any state corrupted by the ThreadAbort will be cleaned up along with the AppDomain.
+
+Synchronization: Managed
+========================
+
+Managed code has access to many synchronization primitives, collected within the System.Threading namespace. These include wrappers for native OS primitives like Mutex, Event, and Semaphore objects, as well as some abstractions such as Barriers and SpinLocks. However, the primary synchronization mechanism used by most managed code is System.Threading.Monitor, which provides a high-performance locking facility on _any managed object_, and additionally provides "condition variable" semantics for signaling changes in the state protected by a lock.
+
+Monitor is implemented as a "hybrid lock;" it has features of both a spin-lock and a kernel-based lock like a Mutex. The idea is that most locks are held only briefly, so it takes less time to simply spin-wait for the lock to be released, than it would to make a call into the kernel to block the thread. It is important not to waste CPU cycles spinning, so if the lock has not been acquired after a brief period of spinning, the implementation falls back to blocking in the kernel.
+
+Because any object may potentially be used as a lock/condition variable, every object must have a location in which to store the lock information. This is done with "object headers" and "sync blocks."
+
+The object header is a machine-word-sized field that precedes every managed object. It is used for many purposes, such as storing the object's hash code. One such purpose is holding the object's lock state. If more per-object data is needed than will fit in the object header, we "inflate" the object by creating a "sync block."
+
+Sync blocks are stored in the Sync Block Table, and are addressed by sync block indexes. Each object with an associated sync block has the index of that index in the object's object header.
+
+The details of object headers and sync blocks are defined in [syncblk.h][syncblk.h]/[.cpp][syncblk.cpp].
+
+[syncblk.h]: https://github.com/dotnet/coreclr/blob/master/src/vm/syncblk.h
+[syncblk.cpp]: https://github.com/dotnet/coreclr/blob/master/src/vm/syncblk.cpp
+
+If there is room on the object header, Monitor stores the managed thread ID of the thread that currently holds the lock on the object (or zero (0) if no thread holds the lock). Acquiring the lock in this case is a simple matter of spin-waiting until the object header's thread ID is zero, and then atomically setting it to the current thread's managed thread ID.
+
+If the lock cannot be acquired in this manner after some number of spins, or the object header is already being used for other purposes, a sync block must be created for the object. This has additional data, including an event that can be used to block the current thread, allowing us to stop spinning and efficiently wait for the lock to be released.
+
+An object that is used as a condition variable (via Monitor.Wait and Monitor.Pulse) must always be inflated, as there is not enough room in the sync block to hold the required state.
+
+Synchronization: Native
+=======================
+
+The native portion of the CLR must also be aware of threading, as it will be invoked by managed code on multiple threads. This requires native synchronization mechanisms, such as locks, events, etc.
+
+The ITaskHost API allows a host to override many aspects of managed threading, including thread creation, destruction, and synchronization. The ability of a host to override native synchronization means that VM code can generally not use native synchronization primitives (Critical Sections, Mutexes, Events, etc.) directly, but rather must use the VM's wrappers over these.
+
+Additionally, as described above, GC suspension is a special kind of "lock" that affects nearly every aspect of the CLR. Native code in the VM may enter "cooperative" mode if it must manipulate GC heap objects, and thus the "GC suspension lock" becomes one of the most important synchronization mechanisms in native VM code, as well as managed.
+
+The major synchronization mechanisms used in native VM code are the GC mode, and Crst.
+
+GC Mode
+-------
+
+As discussed above, all managed code runs in cooperative mode, because it may manipulate the GC heap. Generally, native code does not touch managed objects, and thus runs in preemptive mode. But some native code in the VM must access the GC heap, and thus must run in cooperative mode.
+
+Native code generally does not manipulate the GC mode directly, but rather uses two macros: GCX\_COOP and GCX\_PREEMP. These enter the desired mode, and erect "holders" to cause the thread to revert to the previous mode when the scope is exited.
+
+It is important to understand that GCX\_COOP effectively acquires a lock on the GC heap. No GC may proceed while the thread is in cooperative mode. And native code cannot be "hijacked" as is done for managed code, so the thread will remain in cooperative mode until it explicitly switches back to preemptive mode.
+
+Thus entering cooperative mode in native code is discouraged. In cases where cooperative mode must be entered, it should be kept to as short a time as possible. The thread should not be blocked in this mode, and in particular cannot generally acquire locks safely.
+
+Similarly, GCX\_PREEMP potentially _releases_ a lock that had been held by the thread. Great care must be taken to ensure that all GC references are properly protected before entering preemptive mode.
+
+The [Rules of the Code](../coding-guidelines/clr-code-guide.md) document describes the disciplines needed to ensure safety around GC mode switches.
+
+Crst
+----
+
+Just as Monitor is the preferred locking mechanism for managed code, Crst is the preferred mechanism for VM code. Like Monitor, Crst is a hybrid lock that is aware of hosts and GC modes. Crst also implements deadlock avoidance via "lock leveling," described in the [Crst Leveling chapter of the BotR](../coding-guidelines/clr-code-guide.md#264-entering-and-leaving-crsts).
+
+It is generally illegal to acquire a Crst while in cooperative mode, though exceptions are made where absolutely necessary.
+
+Special Threads
+===============
+
+In addition to managing threads created by managed code, the CLR creates several "special" threads for its own use.
+
+Finalizer Thread
+----------------
+
+This thread is created in every process that runs managed code. When the GC determines that a finalizable object is no longer reachable, it places that object on a finalization queue. At the end of a GC, the finalizer thread is signaled to process all finalizers currently in this queue. Each object is then dequeued, one by one, and its finalizer is executed.
+
+This thread is also used to perform various CLR-internal housekeeping tasks, and to wait for notifications of some external events (such as a low-memory condition, which signals the GC to collect more aggressively). See GCHeap::FinalizerThreadStart for the details.
+
+GC Threads
+----------
+
+When running in "concurrent" or "server" modes, the GC creates one or more background threads to perform various stages of garbage collection in parallel. These threads are wholly owned and managed by the GC, and never run managed code.
+
+Debugger Thread
+---------------
+
+The CLR maintains a single native thread in each managed process, which performs various tasks on behalf of attached managed debuggers.
+
+AppDomain-Unload Thread
+-----------------------
+
+This thread is responsible for unloading AppDomains. This is done on a separate, CLR-internal thread, rather than the thread that requests the AD-unload, to a) provide guaranteed stack space for the unload logic, and b) allow the thread that requested the unload to be unwound out of the AD, if needed.
+
+ThreadPool Threads
+------------------
+
+The CLR's ThreadPool maintains a collection of managed threads for executing user "work items." These managed threads are bound to native threads owned by the ThreadPool. The ThreadPool also maintains a small number of native threads to handle functions like "thread injection," timers, and "registered waits."
diff --git a/Documentation/botr/type-loader.md b/Documentation/botr/type-loader.md
new file mode 100644
index 0000000000..60a13cd22b
--- /dev/null
+++ b/Documentation/botr/type-loader.md
@@ -0,0 +1,317 @@
+Type Loader Design
+===
+
+Author: Ladi Prosek - 2007
+
+# Introduction
+
+In a class-based object oriented system, types are templates
+describing the data that individual instances will contain, and the
+functionality that they will provide. It is not possible to create an
+object without first defining its type<sup>1</sup>. Two objects are said to
+be of the same type if they are instances of the same type. The fact
+that they define the exact same set of members does not make them
+related in any way.
+
+The previous paragraph could as well describe a typical C++
+system. One additional feature essential to CLR is the availability of
+full runtime type information. In order to "manage" the managed code
+and provide type safe environment, the runtime must know the type of
+any object at any time. Such a type information must be readily
+available without extensive computation because the type identity
+queries are expected to be rather frequent (e.g. any type-cast
+involves querying the type identity of the object to verify that the
+cast is safe and can be done).
+
+This performance requirement rules out any dictionary look up
+approaches and leaves us with the following high-level architecture.
+
+![Figure 1](../images/typeloader-fig1.png)
+
+Figure 1 The abstract high-level object design
+
+Apart from the actual instance data, each object contains a type id
+which is simply a pointer to the structure that represents the
+type. This concept is similar to C++ v-table pointers, but the
+structure, which we call TYPE now and will define it more precisely
+later, contains more than just a v-table. For instance, it has to
+contain information about the hierarchy so that "is-a" subsumption
+questions can be answered.
+
+<sup>1</sup> The C# 3.0 feature called "anonymous types" lets you define an
+object without explicit reference to a type - simply by directly
+listing its fields. Don't let this fool you, there is in fact a type
+created behind the scenes for you by the compiler.
+
+## 1.1 Related Reading
+
+[1] Martin Abadi, Luca Cardelli, A Theory of Objects, ISBN
+978-0387947754
+
+[2] Andrew Kennedy ([@andrewjkennedy](https://github.com/andrewjkennedy)), Don Syme ([@dsyme](https://github.com/dsyme)), [Design and Implementation of Generics
+for the .NET Common Language
+Runtime][generics-design]
+
+[generics-design]: http://research.microsoft.com/apps/pubs/default.aspx?id=64031
+
+[3] [ECMA Standard for the Common Language Infrastructure (CLI)](http://www.ecma-international.org/publications/standards/Ecma-335.htm)
+
+## 1.2 Design Goals
+
+The ultimate purpose of the type loader (sometimes referred to as the
+class loader, which is strictly speaking not correct, because classes
+constitute just a subset of types - namely reference types - and the
+loader loads value types as well) is to build data structures
+representing the type which it is asked to load. These are the
+properties that the loader should have:
+
+- Fast type lookup ([module, token] => handle and [assembly, name] => handle).
+- Optimized memory layout to achieve good working set size, cache hit rate, and JITted code performance.
+- Type safety - malformed types are not loaded and a TypeLoadException is thrown.
+- Concurrency - scales well in multi-threaded environments.
+
+# 2 Type Loader Architecture
+
+There is a relatively small number of entry-points to the loader. Although the signature of each individual entry-point is slightly different, they all have the similar semantics. They take a type/member designation in the form of a metadata **token** or a **name** string, a scope for the token (a **module** or an **assembly** ), and some additional information like flags. They return the loaded entity in the form of a **handle**.
+
+There are usually many calls to the type loader during JITting. Consider:
+
+ object CreateClass()
+ {
+ return new MyClass();
+ }
+
+In the IL, MyClass is referred to using a metadata token. In order to generate a call to the **JIT\_New** helper which takes care of the actual instantiation, the JIT will ask the type loader to load the type and return a handle to it. This handle will be then directly embedded in the JITted code as an immediate value. The fact that types and members are usually resolved and loaded at JIT time and not at run-time also explains the sometimes confusing behavior easily hit with code like this:
+
+ object CreateClass()
+ {
+ try {
+ return new MyClass();
+ } catch (TypeLoadException) {
+ return null;
+ }
+ }
+
+If **MyClass** fails to load, for example because it's supposed to be defined in another assembly and it was accidentally removed in the newest build, then this code will still throw **TypeLoadException**. The reason that the catch block did not catch it is that it never ran! The exception occurred during JITting and would only be catchable in the method that called **CreateClass** and caused it to be JITted. In addition, it may not be always obvious at which point the JITting is triggered due to inlining, so users should not expect and rely on deterministic behavior.
+
+## Key Data Structures
+
+The most universal type designation in the CLR is the **TypeHandle**. It's an abstract entity which encapsulates a pointer to either a **MethodTable** (representing "ordinary" types like **System.Object** or **List<string>** ) or a **TypeDesc** (representing byrefs, pointers, function pointers, arrays, and generic variables). It constitutes the identity of a type in that two handles are equal if and only if they represent the same type. To save space, the fact that a **TypeHandle** contains a **TypeDesc** is indicated by setting the second lowest bit of the pointer to 1 (i.e. (ptr | 2)) instead of using additional flags<sup>2</sup>. **TypeDesc** is "abstract" and has the following inheritance hierarchy.
+
+![Figure 2](../images/typeloader-fig2.png)
+
+Figure 2 The TypeDesc hierarchy
+
+**TypeDesc**
+
+Abstract type descriptor. The concrete descriptor type is determined by flags.
+
+**TypeVarTypeDesc**
+
+Represents a type variable, i.e. the **T** in **List<T>** or in **Array.Sort<T>** (see the part about generics below). Type variables are never shared between multiple types or methods so each variable has its one and only owner.
+
+**FnPtrTypeDesc**
+
+Represents a function pointer, essentially a variable-length list of type handles referring to the return type and parameters. It's not that common to see this descriptor because function pointers are not supported by C#. However, managed C++ uses them.
+
+**ParamTypeDesc**
+
+This descriptor represents a byref and pointer types. Byrefs are the results of the **ref** and **out** C# keywords applied to method parameters<sup>3</sup> whereas pointer types are unmanaged pointers to data used in unsafe C# and managed C++.
+
+**ArrayTypeDesc**
+
+Represents array types. It is derived from **ParamTypeDesc** because arrays are also parameterized by a single parameter (the type of their element). This is opposed to generic instantiations whose number of parameters is variable.
+
+**MethodTable**
+
+This is by far the central data structure of the runtime. It represents any type which does not fall into one of the categories above (this includes primitive types, and generic types, both "open" and "closed"). It contains everything about the type that needs to be looked up quickly, such as its parent type, implemented interfaces, and the v-table.
+
+**EEClass**
+
+**MethodTable** data are split into "hot" and "cold" structures to improve working set and cache utilization. **MethodTable** itself is meant to only store "hot" data that are needed in program steady state. **EEClass** stores "cold" data that are typically only needed by type loading, JITing or reflection. Each **MethodTable** points to one **EEClass**.
+
+Moreover, **EEClasse**s are shared by generic types. Multiple generic type **MethodTable**s can point to single **EEClass**. This sharing adds additional constrains on data that can be stored on **EEClass**.
+
+**MethodDesc**
+
+It is no surprise that this structure describes a method. It actually comes in a few flavors which have their corresponding **MethodDesc** subtypes but most of them really are out of the scope of this document. Suffice it to say that there is one subtype called **InstantiatedMethodDesc** which plays an important role for generics. For more information please see [**Method Descriptor Design**](method-descriptor.md).
+
+**FieldDesc**
+
+Analogous to **MethodDesc** , this structure describes a field. Except for certain COM interop scenarios, the EE does not care about properties and events at all because they boil down to methods and fields at the end of the day, and it's just compilers and reflection who generate and understand them in order to provide that syntactic sugar kind of experience.
+
+<sup>2</sup> This is useful for debugging. If the value of a **TypeHandle**
+ends with 2, 6, A, or E, then it's not a **MethodTable** and the extra
+bit has to be cleared in order to successfully inspect the
+**TypeDesc**.
+
+<sup>3</sup> Note that the difference between **ref** and **out** is just in a
+parameter attribute. As far as the type system is concerned, they are
+both the same type.
+
+## 2.1 Load Levels
+
+When the type loader is asked to load a specified type, identified for example by a typedef/typeref/typespec **token** and a **Module** , it does not do all the work atomically at once. The loading is done in phases instead. The reason for this is that the type usually depends on other types and requiring it to be fully loaded before it can be referred to by other types would result in infinite recursion and deadlocks. Consider:
+
+ classA<T> : C<B<T>>
+ { }
+
+ classB<T> : C<A<T>>
+ { }
+
+ classC<T>
+ { }
+
+These are valid types and apparently **A** depends on **B** and **B** depends on **A**.
+
+The loader initially creates the structure(s) representing the type and initializes them with data that can be obtained without loading other types. When this "no-dependencies" work is done, the structure(s) can be referred from other places, usually by sticking pointers to them into another structures. After that the loader progresses in incremental steps and fills the structure(s) with more and more information until it finally arrives at a fully loaded type. In the above example, the base types of **A** and **B** will be approximated by something that does not include the other type, and substituted by the real thing later.
+
+The exact half-loaded state is described by the so-called load level, starting with CLASS\_LOAD\_BEGIN, ending with CLASS\_LOADED, and having a couple of intermediate levels in between. There are rich and useful comments about individual load levels in the [classloadlevel.h](https://github.com/dotnet/coreclr/blob/master/src/vm/classloadlevel.h) source file. Notice that although types can be saved in NGEN images, the representing structures cannot be simply mapped or blitted into memory and used without additional work called "restoring". The fact that a type came from an NGEN image and needs to be restored is also captured by its load level.
+
+See [Design and Implementation of Generics
+for the .NET Common Language
+Runtime][generics-design] for more detailed explanation of load levels.
+
+## 2.2 Generics
+
+In the generics-free world, everything is nice and everyone is happy because every ordinary (not represented by a **TypeDesc**) type has one **MethodTable** pointing to its associated **EEClass** which in turn points back to the **MethodTable**. All instances of the type contain a pointer to the **MethodTable** as their first field at offset 0, i.e. at the address seen as the reference value. To conserve space, **MethodDescs** representing methods declared by the type are organized in a linked list of chunks pointed to by the **EEClass**<sup>4</sup>.
+
+![Figure 3](../images/typeloader-fig3.png)
+
+Figure 3 Non-generic type with non-generic methods
+
+<sup>4</sup> Of course, when managed code runs, it does not call methods by
+looking them up in the chunks. Calling a method is a very "hot"
+operation and normally needs to access only information in the
+**MethodTable**.
+
+### 2.2.1 Terminology
+
+**Generic Parameter**
+
+A placeholder to be substituted by another type; the **T** in the declaration of **List<T>**. Sometimes called formal type parameter. A generic parameter has a name and optional generic constraints.
+
+**Generic Argument**
+
+A type being substituted for a generic parameter; the **int** in **List<int>**. Note that a generic parameter can also be used as an argument. Consider:
+
+ List<T> GetList<T>()
+ {
+ return new List<T>();
+ }
+
+The method has one generic parameter **T** which is used as a generic argument for the generic list class.
+
+**Generic Constraint**
+
+An optional requirement placed by generic parameters on its potential generic arguments. Types that do not have the required properties may not be substituted for the generic parameter and it is enforced by the type loader. There are three kinds of generic constraints:
+
+1. Special constraints
+ - Reference type constraint - the generic argument must be a reference type (as opposed to a value type). The `class` keyword is used in C# to express this constraint.
+
+ public class A<T> where T : class
+
+ - Value type constraint - the generic argument must be a value type different from `System.Nullable<T>`. C# uses the `struct` keyword.
+
+ public class A<T> where T : struct
+
+ - Default constructor constraint - the generic argument must have a public parameterless constructor. This is expressed by `new()` in C#.
+
+ public class A<T> where T : new()
+
+2. Base type constraints - the generic argument must be derived from
+(or directly be of) the given non-interface type. It obviously makes
+sense to use only zero or one reference type as a base types
+constraint.
+
+ public class A<T> where T : EventArgs
+
+3. Implemented interface constraints - the generic argument must
+implement (or directly be of) the given interface type. Zero or more
+interfaces can be given.
+
+ public class A<T> where T : ICloneable, IComparable<T>
+
+The above constraints are combined with an implicit AND, i.e. a
+generic parameter can be constrained to be derived from a given type,
+implement several interfaces, and have the default constructor. All
+generic parameters of the declaring type can be used to express the
+constraints, introducing interdependencies among the parameters. For
+example:
+
+ public class A<S, T, U>
+ where S : T
+ where T : IList<U> {
+ void f<V>(V v) where V : S {}
+ }
+
+**Instantiation**
+
+The list of generic arguments that were substituted for generic
+parameters of a generic type or method. Each loaded generic type and
+method has its instantiation.
+
+**Typical Instantiation**
+
+An instantiation consisting purely of the type's or method's own type
+parameters and in the same order in which the parameters are
+declared. There exists exactly one typical instantiation for each
+generic type and method. Usually when one talks about an open generic
+type, they have the typical instantiation in mind. Example:
+
+ public class A<S, T, U> {}
+
+The C# `typeof(A<,,>)` compiles to ldtoken A\'3 which makes the
+runtime load **A`3** instantiated at **S** , **T** , **U**.
+
+**Canonical Instantiation**
+
+An instantiation where all generic arguments are
+**System.\_\_Canon**. **System.\_\_Canon** is an internal type defined
+in **mscorlib** and its task is just to be well-known and different
+from any other type which may be used as a generic
+argument. Types/methods with canonical instantiation are used as
+representatives of all instantiations and carry information shared by
+all instantiations. Since **System.\_\_Canon** can obviously not
+satisfy any constraints that the respective generic parameter may have
+on it, constraint checking is special-cased with respect to
+**System.\_\_Canon** and ignores these violations.
+
+### 2.2.2 Sharing
+
+With the advent of generics, the number of types loaded by the runtime
+tends to be higher. Although generic types with different
+instantiations (for example **List&lt;string>** and **List&lt;object>**)
+are different types each with its own **MethodTable** , it turns out
+that there is a considerable amount of information that they can
+share. This sharing has a positive impact on the memory footprint and
+consequently also performance.
+
+![Figure 4](../images/typeloader-fig4.png)
+
+Figure 4 Generic type with non-generic methods - shared EEClass
+
+Currently all instantiations containing reference types share the same
+**EEClass** and its **MethodDescs**. This is feasible because all
+references are of the same size - 4 or 8 bytes - and hence the layout
+of all these types is the same. The figure illustrates this for
+**List&lt;object>** and **List&lt;string>**. The canonical **MethodTable**
+was created automatically before the first reference type
+instantiation was loaded and contains data which is hot but not
+instantiation specific like non-virtual slots or
+**RemotableMethodInfo**. Instantiations containing only value types
+are not shared and every such instantiated type gets its own unshared
+**EEClass**.
+
+**MethodTables** representing generic types loaded so far are cached
+in a hash table owned by their loader module<sup>5</sup>. This hash table is
+consulted before a new instantiation is constructed, making sure
+that there will never be two or more **MethodTable** instances
+representing the same type.
+
+See [Design and Implementation of Generics
+for the .NET Common Language
+Runtime][generics-design] for more information about generic sharing.
+
+<sup>5</sup> Things get a bit more complicated for types loaded from NGEN
+images.
diff --git a/Documentation/botr/type-system.md b/Documentation/botr/type-system.md
new file mode 100644
index 0000000000..ca5f23463a
--- /dev/null
+++ b/Documentation/botr/type-system.md
@@ -0,0 +1,233 @@
+Type System Overview
+====================
+
+Author: David Wrighton ([@davidwrighton](https://github.com/davidwrighton)) - 2010
+
+Introduction
+============
+
+The CLR type system is our representation the type system described in the ECMA specification + extensions.
+
+Overview
+--------
+
+The type sytem is composed of a series of data structures, some of which are described in other Book of the Runtime chapters, as well as a set of algorithms which operate on and create those data structures. It is NOT the type system exposed through reflection, although that one does depend on this system.
+
+The major data structures maintained by the type system are:
+
+- MethodTable
+- EEClass
+- MethodDesc
+- FieldDesc
+- TypeDesc
+- ClassLoader
+
+The major algorithms contained within the type system are:
+
+- **Type Loader:** Used to load types and create most of the primary data structures of the type system.
+- **CanCastTo and similar:** The functionality of comparing types.
+- **LoadTypeHandle:** Primarily used for finding types.
+- **Signature parsing:** Used to compare and gather information about methods and fields.
+- **GetMethod/FieldDesc:** Used to find/load methods/fields.
+- **Virtual Stub Dispatch:** Used to find the destination of virtual calls to interfaces.
+
+There are significantly more ancillary data structures and algorithms that provide various bits of information to the rest of the CLR, but they are less significant to the overall understanding of the system.
+
+Component Architecture
+----------------------
+
+The type system's data structures are generally used by all of the various algorithms. This document does not describe the type system algorithms (as there are or should be other book of the runtime documents for those), but it does attempt to describe the various major data structures below.
+
+Dependencies
+------------
+
+The type system is generally a service provided to many parts of the CLR, and most core components have some form of dependency on the behavior of the type system. This diagram describes the general dataflow that effects the type system. It is not exhaustive, but calls out the major information flows.
+
+![dependencies](../images/type-system-dependencies.png)
+
+### Component Dependencies
+
+The primary dependencies of the type system follow:
+
+- The **loader** needed to get the correct metadata to work with.
+- The **metadata system** provides a metadata api to gather information.
+- The **security system** informs the type system whether or not certain type system structures are permitted (e.g. inheritance).
+- The **AppDomain** provides a LoaderAllocator to handle allocation behavior for the type system data structures.
+
+### Components Dependent on this Component
+
+The type system has 3 primary components which depend on it.
+
+- The **Jit interface**, and the jit helpers primarily depends on the type, method, and field searching functionality. Once the type system object is found, the data structures returned have been tailored to provide the information needed by the jit.
+- **Reflection** uses the type system to provide relatively simple access to ECMA standardized concepts which we happen to capture in the CLR type system data structures.
+- **General managed code execution** requires the use of the type system for type comparison logic, and virtual stub dispatch.
+
+Design of Type System
+=====================
+
+The core type system data structures are the data structures that represent the actual loaded types (e.g. TypeHandle, MethodTable, MethodDesc, TypeDesc, EEClass) and the data structure that allow types to be found once they are loaded (e.g. ClassLoader, Assembly, Module, RIDMaps).
+
+The data structures and algorithms for loading types are discussed in the [Type Loader](type-loader.md) and [MethodDesc](method-descriptor.md) Book of the Runtime chapters.
+
+Tying those data structures together is a set of functionality that allows the JIT/Reflection/TypeLoader/stackwalker to find existing types and methods. The general idea is that these searches should be easily driven by the metadata tokens/signatures that are specified in the ECMA CLI specification.
+
+And finally, when the appropriate type system data structure is found, we have algorithms to gather information from a type, and/or compare two types. A particularly complicated example of this form of algorithm may be found in the [Virtual Stub Dispatch](virtual-stub-dispatch.md) Book of the Runtime chapter.
+
+Design Goals and Non-goals
+--------------------------
+
+### Goals
+
+- Accessing information needed at runtime from executing (non-reflection) code is very fast.
+- Accessing information needed at compilation time for generating code is straightforward.
+- The garbage collector/stackwalker is able to access necessary information without taking locks, or allocating memory.
+- Minimal amounts of types are loaded at a time.
+- Minimal amounts of a given type are loaded at type load time.
+- Type system data structures must be storable in NGEN images.
+
+### Non-Goals
+
+- All information in the metadata is directly reflected in the CLR data structures.
+- All uses of reflection are fast.
+
+Design of a typical algorithm used at runtime during execution of managed code
+------------------------------------------------------------------------------
+
+The casting algorithm is typical of algorithms in the type system that are heavily used during the execution of managed code.
+
+There are at least 4 separate entrypoints into this algorithm. Each entrypoint is chosen to provide a different fast path, in the hopes that the best performance possible will be achieved.
+
+- Can an object be cast to a particular non-type equivalent non-array type?
+- Can an object be cast to an interface type that does not implement generic variance?
+- Can an object be cast to an array type?
+- Can an object of a type be cast to an arbitrary other managed type?
+
+Each of these implementations with the exception of the last one is optimized to perform better at the expense of not being fully general.
+
+For instance, the "Can a type be cast to a parent type" which is a variant of "Can an object be cast to a particular non-type equivalent non-array type?" code is implemented with a single loop that walks a singly linked list. This is only able to search a subset of possible casting operations, but it is possible to determine if that is the appropriate set by examining the type the cast is trying to enforce. This algorithm is implemented in the jit helper JIT\_ChkCastClass\_Portable.
+
+Assumptions:
+
+- Special purpose implementations of algorithms are a performance improvement in general.
+- Extra versions of algorithms do not provide an insurmountable maintenance problem.
+
+Design of typical search algorithm in the Type System
+-----------------------------------------------------
+
+There are a number of algorithms in the type system which follow this common pattern.
+
+The type system is commonly used to find a type. This may be triggered via any number of inputs such as the JIT, reflection, serialization, remoting, etc.
+
+The basic input to the type system in these cases is
+
+- The context from which the search shall begin (a Module or assembly pointer).
+- An identifier that describes the sought after type in the initial context. This is typically a token, or a string (if an assembly is the search context).
+
+The algorithm must first decode the identifier.
+
+For the search for a type scenario, the token may be either a TypeDef token, a TypeRef token, a TypeSpec token, or a string. Each of these different identifiers will cause a different form of lookup.
+
+- A **typedef token** will cause a lookup in the RidMap of the Module. This is a simple array index.
+- A **typeref token** will cause a lookup to find the assembly which this typeref token refers to, and then the type finding algorithm is begun anew with the found assembly pointer, and a string gathered from the typeref table.
+- A **typespec token** indicates that a signature must be parsed to find the signature. Parse the signature to find the information necessary to load the type. This will recursively trigger more type finding.
+- A **name** is used to bind between assemblies. The TypeDef/ExportedTypes table is searched for matches. Note: This search is optimized by hashtables on the manifest module object.
+
+From this design a number of common characteristics of search algorithms in the type system are evident.
+
+- Searches use input that is tightly coupled to metadata. In particular, metadata tokens and string names are commonly passed around. Also, these searches are tied to Modules, which directly map to .dll and .exe files.
+- Use of cached information to improve performance. The RidMap and hash tables are data structures optimized to improve these lookups.
+- The algorithms typically have 3-4 different paths based on their input.
+
+In addition to this general design, there are a number of extra requirements that are layered onto this.
+
+- **ASSUMPTION:** Searching for types that are already loaded is safe to perform while stopped in the GC.
+- **INVARIANT:** A type which has already been loaded will always be found if searched for.
+- **ISSUE:** Search routines rely on metadata reading. This can yield inadequate performance in some scenarios.
+
+This search algorithm is typical of the routines used during JITing. It has a number of common characteristics.
+
+- It uses metadata.
+- It requires looking for data in many places.
+- There is relatively little duplication of data in our data structures.
+- It typically does not recurse deeply, and does not have loops.
+
+This allows us to meet the performance requirements, and characteristics necessary for working with an IL based JIT.
+
+Garbage Collector Requirements on the Type System
+-------------------------------------------------
+
+The garbage collector requires information about instances of types allocated in the GC heap. This is done via a pointer to a type system data structure (MethodTable) at the head of every managed object. Attached to the MethodTable, is a data structure that describes the GC layout of instances of types. There are two forms of this layout (one for normal types, and object arrays, and another for arrays of valuetypes).
+
+- **ASSUMPTION:** Type system data structures have a lifetime that exceeds that of managed objects that are of types described in the type system data structure.
+- **REQUIREMENT:** The garbage collector has a requirement to execute the stack walker while the runtime is suspended. This will be discussed next.
+
+Stackwalker requirements on the Type System
+-------------------------------------------
+
+The stack walker/ GC stack walker requires type system input in 2 cases.
+
+- For finding the size of valuetypes on the stack.
+- For finding GC roots to report within valuetypes on the stack.
+
+For various reasons involving the desire to delay load types, and the avoidance of generating multiple versions of code (that only differ via associated gc info) the CLR currently requires the walking of signatures of methods that are on the stack. This need is rarely exercised, as it requires the stack walker to execute at very particular moments in time, but in order to meet our reliability goals, the signature walker must be able to function while stackwalking.
+
+The stack walker executes in approximately 3 modes.
+
+- To walk the stack of the current thread for security or exception processing reasons.
+- To walk the stack of all threads for GC purposes (all threads are suspended by the EE).
+- To walk the stack of a particular thread for a profiler (that specific thread is suspended).
+
+In the GC stack walking case, and in the profiler stack walking case, due to thread suspension, it is not safe to allocate memory or take most locks.
+
+This has led us to develop a path through the type system which may be relied upon to follow the above requirement.
+
+The rule required for the type system to achieve this goal is:
+
+- If a method has been called, then all valuetype parameters of the called method will have been loaded into some appdomain in the process.
+- The assembly reference from the assembly with the signature to the assembly implementing the type must be resolved before a walk of the signature is necessary as part of a stack walk.
+
+This is enforced via an extensive and complicated set of enforcements within the type loader, NGEN image generation process, and JIT.
+
+- **ISSUE:** Stackwalker requirements on the type system are HIGHLY fragile.
+- **ISSUE:** Implementation of stack walker requirements in the type system requires a set of contract violations at every function in the type system that may be touched while searching for types which are loaded.
+- **ISSUE:** The signature walks performed are done with the normal signature walking code. This code is designed to load types as it walks the signature, but in this case the type load functionality is used with the assumption that no type load will actually be triggered.
+- **ISSUE:** Stackwalker requirements require support from not just the type system, but also the assembly loader. The Loader has had a number of issues meeting the needs of the type system here.
+
+Type System and NGEN
+--------------------
+
+The type system data structures are a core part of what is saved into NGEN images. Unfortunately, these data structures logically have pointers within them that point to other NGEN images. In order to handle this situation, the type system data structures implement a concept known as restoration.
+
+In restoration, when a type system data structure is first needed, the data structure is fixed up with correct pointers. This is tied into the type loading levels described in the [Type Loader](type-loader.md) Book of the Runtime chapter.
+
+There also exists the concept of pre-restored data structures. This means that the data structure is sufficiently correct at ngen image load time (after intra-module pointer fixups and eager load type fixups), that the data structure may be used as is. This optimization requires that the ngen image be "hard bound" to its dependent assemblies. See NGEN documentation for further details.
+
+Type System and Domain Neutral Loading
+--------------------------------------
+
+The type system is a core part of the implementation of domain neutral loading. This is exposed to customers through the LoaderOptimization options available at AppDomain creation. Mscorlib is always loaded as domain neutral. The core requirement of this feature is that the type system data structures must not require pointers to domain specific state. Primarily this manifests itself in requirements around static fields and class constructors. In particular, whether or not a class constructor has been run is not a part of the core MethodTable data structure for this reason, and there is a mechanism for storing static data attached to the DomainFile data structure instead of the MethodTable data structure.
+
+Physical Architecture
+=====================
+
+Major parts of the type system are found in:
+
+- Class.cpp/inl/h – EEClass functions, and BuildMethodTable
+- MethodTable.cpp/inl/h – Functions for manipulating methodtables.
+- TypeDesc.cpp/inl/h – Functions for examining TypeDesc
+- MetaSig.cpp SigParser – Signature code
+- FieldDesc /MethodDesc – Functions for examining these data structures
+- Generics – Generics specific logic.
+- Array – Code for handling the special cases required for array processing
+- VirtualStubDispatch.cpp/h/inl – Code for virtual stub dispatch
+- VirtualCallStubCpu.hpp – Processor specific code for virtual stub dispatch.
+
+Major entry points are BuildMethodTable, LoadTypeHandleThrowing, CanCastTo\*, GetMethodDescFromMemberDefOrRefOrSpecThrowing, GetFieldDescFromMemberRefThrowing, CompareSigs, and VirtualCallStubManager::ResolveWorkerStatic.
+
+Related Reading
+===============
+
+- [ECMA CLI Specification](../project-docs/dotnet-standards.md)
+- [Type Loader](type-loader.md) Book of the Runtime Chapter
+- [Virtual Stub Dispatch](virtual-stub-dispatch.md) Book of the Runtime Chapter
+- [MethodDesc](method-descriptor.md) Book of the Runtime Chapter
diff --git a/Documentation/botr/virtual-stub-dispatch.md b/Documentation/botr/virtual-stub-dispatch.md
new file mode 100644
index 0000000000..8d5a52c47a
--- /dev/null
+++ b/Documentation/botr/virtual-stub-dispatch.md
@@ -0,0 +1,188 @@
+Virtual Stub Dispatch
+=====================
+
+Author: Simon Hall ([@snwbrdwndsrf](https://github.com/snwbrdwndsrf)) - 2006
+
+Introduction
+============
+
+Virtual stub dispatching (VSD) is the technique of using stubs for virtual method invocations instead of the traditional virtual method table. In the past, interface dispatch required that interfaces had process-unique identifiers, and that every loaded interface was added to a global interface virtual table map. This requirement meant that all interfaces and all classes that implemented interfaces had to be restored at runtime in NGEN scenarios, causing significant startup working set increases. The motivation for stub dispatching was to eliminate much of the related working set, as well as distribute the remaining work throughout the lifetime of the process.
+
+Although it is possible for VSD to dispatch both virtual instance and interface method calls, it is currently used only for interface dispatch.
+
+Dependencies
+------------
+
+### Component Dependencies
+
+The stub dispatching code exists relatively independently of the rest of the runtime. It provides an API that allows dependent components to use it, and the dependencies listed below comprise a relatively small surface area.
+
+#### Code Manager
+
+VSD effectively relies on the code manager to provide information about state of a method, in particular, whether or not any particular method has transitioned to its final state in order that VSD may decide on details such as stub generation and target caching.
+
+#### Types and Methods
+
+MethodTables hold pointers to the dispatch maps used to determine the target code address for any given VSD call site.
+
+#### Special Types
+
+Calls on COM interop types must be custom dispatched, as they both have specialized target resolution.
+
+### Components Dependent on this Component
+
+#### Code Manager
+
+The code manager relies on VSD for providing the JIT compiler with call site targets for interface calls.
+
+#### Class Builder
+
+The class builder uses the API exposed by the dispatch mapping code to create dispatch maps during type building that will be used at dispatch type by the VSD code.
+
+Design Goals and Non-goals
+--------------------------
+
+### Goals
+
+#### Working Set Reduction
+
+Interface dispatch was previously implemented using a large, somewhat sparse vtable lookup map dealing with process-wide interface identifiers. The goal was to reduce the amount of cold working set by generating dispatch stubs as they were required, in theory keeping related call sites and their dispatch stubs close to each other and increasing the working set density.
+
+It is important to note that the initial working set involved with VSD is higher per call site due to the data structures required to track the various stubs that are created and collected as the system runs; however, as an application reaches steady state, these data structures are not needed for simple dispatching and so gets paged out. Unfortunately, for client applications this equated to a slower startup time, which is one of the factors that led to disabling VSD for virtual methods.
+
+#### Throughput Parity
+
+It was important to keep interface and virtual method dispatch at an amortized parity with the previous vtable dispatch mechanism.
+
+While it was immediately obvious that this was achievable with interface dispatch, it turned out to be somewhat slower with virtual method dispatch, one of the factors that led to disabling VSD for virtual methods.
+
+Design of Token Representation and Dispatch Map
+-----------------------------------------------
+
+Dispatch tokens are native word-sized values that are allocated at runtime, consisting internally of a tuple that represents an interface and slot.
+
+The design uses a combination of assigned type identifier values and slot numbers. Dispatch tokens consist of a combination of these two values. To facilitate integration with the runtime, the implementation also assigns slot numbers in the same way as the classic v-table layout. This means that the runtime can still deal with MethodTables, MethodDescs, and slot numbers in exactly the same way, except that the v-table must be accessed via helper methods instead of being directly accessed in order to handle this abstraction.
+
+The term _slot_ will always be used in the context of a slot index value in the classic v-table layout world and as created and interpreted by the mapping mechanism. What this means is that this is the slot number if you were to picture the classic method table layout of virtual method slots followed by non-virtual method slots, as previously implemented in the runtime. It's important to understand this distinction because within the runtime code, slot means both an index into the classic v-table structure, as well as the address of the pointer in the v-table itself. The change is that slot is now only an index value, and the code pointer addresses are contained in the implementation table (discussed below).
+
+The dynamically assigned type identifier values will be discussed later on.
+
+### Method Table
+
+#### Implementation Table
+
+This is an array that, for each method body introduced by the type, has a pointer to the entrypoint to that method. Its members are arranged in the following order:
+
+- Introduced (newslot) virtual methods.
+- Introduced non-virtual (instance and static) methods.
+- Overriding virtual methods.
+
+The reason for this format is that it provides a natural extension to the classic v-table layout. As a result many entries in the slot map (described below) can be inferred by this order and other details such as the total number of virtuals and non-virtuals for the class.
+
+When stub dispatch for virtual instance methods is disabled (as it is currently), the implementation table is non-existent and is substituted with a true vtable. All mapping results are expressed as slots for the vtable rather than an implementation table. Keep this in mind when implementation tables are mentioned throughout this document.
+
+#### Slot Map
+
+The slot map is a table of zero or more <_type_, [<_slot_, _scope_, (_index | slot_)>]> entries. _type_ is the dynamically assigned identification number mentioned above, and is either a sentinel value to indicate the current class (a call to a virtual instance method), or is an identifier for an interface implemented by the current class (or implicitly by one if its parents). The sub-map (contained in brackets) has one or more entries. Within each entry, the first element always indicates a slot within _type_. The second element, _scope_, specifies whether or not the third element is an implementation _index_ or a _slot_ number. _scope_ can be a known sentinel value that indicates that the next number is to be interpreted as a virtual slot number, and should be resolved virtually as _this.slot_. _scope_ can also identify a particular class in the inheritance hierarchy of the current class, and in such a case the third argument is an _index_ into the implementation table of the class indicated by _scope_, and is the final method implementation for _type.slot_.
+
+#### Example
+
+The following is a small class structure (modeled in C#), and what the resulting implementation table and slot map would be for each class.
+
+![Figure 1](../images/virtualstubdispatch-fig1.png)
+
+Thus, looking at this map, we see that the first column of the sub-maps of the slot maps correspond to the slot number in the classic virtual table view (remember that System.Object contributes four virtual methods of its own, which are omitted for clarity). Searches for method implementations are always bottom-up. Thus, if I had an object of type _B_ and I wished to invoke _I.Foo_, I would look for a mapping of _I.Foo_ starting at _B_'s slot map. Not finding it there, I would look in _A_'s slot map and find it there. It states that virtual slot 0 of _I_ (corresponding to _I.Foo_) is implemented by virtual slot 0. Then I return to _B_'s slot map and search for an implementation for slot 0, and find that it is implemented by slot 1 in its own implementation table.
+
+### Additional Uses
+
+It is important to note that this mapping technique can be used to implement methodimpl re-mapping of virtual slots (i.e., a virtual slot mapping in the map for the current class, similar to how an interface slot is mapped to a virtual slot). Because of the scoping capabilities of the map, non-virtual methods may also be referenced. This may be useful if ever the runtime wants to support the implementation of interfaces with non-virtual methods.
+
+### Optimizations
+
+The slot maps are bit-encoded and take advantage of typical interface implementation patterns using delta values, thus reducing the map size significantly. In addition, new slots (both virtual and non-) can be implied by their order in the implementation table. If the table contains new virtual slots followed by new instance slots, then followed by overrides, then the appropriate slot map entries can be implied by their index in the implementation table combined with the number of virtuals inherited by the parent class. All such implied map entries have been indicated with a (\*). The current layout of data structures uses the following pattern, where the DispatchMap is only present when mappings cannot be fully implied by ordering in the implementation table.
+
+ MethodTable -> [DispatchMap ->] ImplementationTable
+
+Type ID Map
+-----------
+
+This will map types to IDs, which are allocated as monotonically increasing values as each previously unmapped type is encountered. Currently, all such types are interfaces.
+
+Currently, this is implemented using a HashMap, and contains entries for both lookup directions.
+
+Dispatch Tokens
+---------------
+
+Dispatch tokens will be <_typeID_,_slot_> tuples. For interfaces, the type will be the interface ID assigned to that type. For virtual methods, this will be a constant value to indicate that the slot should just be resolved virtually within the type to be dispatched on (a virtual method call on _this_). This value pair will in most cases fit into the platform's native word size. On x86, this will likely be the lower 16 bits of each value, concatenated. This can be generalized to handle overflow issues similar to how a _TypeHandle_ in the runtime can be either a _MethodTable_ pointer or a <_TypeHandle,TypeHandle_> pair, using a sentinel bit to differentiate the two cases. It has yet to be determined if this is necessary.
+
+Design of Virtual Stub Dispatch
+===============================
+
+Dispatch Token to Implementation Resolution
+-------------------------------------------
+
+Given a token and type, the implementation is found by mapping the token to an implementation table index for the type. The implementation table is reachable from the type's MethodTable. This map is created in BuildMethodTable: it enumerates all interfaces implemented by the type for which it is building a MethodTable and determines every interface method that the type implements or overrides. By keeping track of this information, at interface dispatch time it is possible to determine the target code given the token and the target object (from which the MethodTable and token mapping can be obtained).
+
+Stubs
+-----
+
+Interface dispatch calls go through stubs. These stubs are all generated on demand, and all have the ultimate purpose of matching a token and object with an implementation, and forwarding the call to that implementation.
+
+There are currently three types of stubs. The below diagram shows the general control flow between these stubs, and will be explained below.
+
+![Figure 2](../images/virtualstubdispatch-fig2.png)
+
+### Generic Resolver
+
+This is in fact just a C function that serves as the final failure path for all stubs. It takes a <_token_, _type_> tuple and returns the target. The generic resolver is also responsible for creating dispatch and resolver stubs when they are required, patching indirection cells when better stubs become available, caching results, and all bookkeeping.
+
+### Lookup Stubs
+
+These stubs are the first to be assigned to an interface dispatch call site, and are created when the JIT compiles an interface call site. Since the JIT has no knowledge of the type being used to satisfy a token until the first call is made, this stub passes the token and type as arguments to the generic resolver. If necessary, the generic resolver will also create dispatch and resolve stubs, and will then back patch the call site to the dispatch stub so that the lookup stub is no longer used.
+
+One lookup stub is created for each unique token (i.e., call sites for the same interface slot will use the same lookup stub).
+
+### Dispatch Stubs
+
+These stubs are used when a call site is believed to be monomorphic in behaviour. This means that the objects used at a particular call site are typically the same type (i.e. most of the time the object being invoked is the same as the last object invoked at the same site.) A dispatch stub takes the type (MethodTable) of the object being invoked and compares it with its cached type, and upon success jumps to its cached target. On x86, this is typically results in a "comparison, conditional failure jump, jump to target" sequence and provides the best performance of any stub. If a stub's type comparison fails, it jumps to its corresponding resolve stub (see below).
+
+One dispatch stub is created for each unique <_token_,_type_> tuple, but only lazily when a call site's lookup stub is invoked.
+
+### Resolve Stubs
+
+Polymorphic call sites are handled by resolve stubs. These stubs use the key pair <_token_, _type_> to resolve the target in a global cache, where _token_ is known at JIT time and _type_ is determined at call time. If the global cache does not contain a match, then the final step of the resolve stub is to call the generic resolver and jump to the returned target. Since the generic resolver will insert the <_token_, _type_, _target_> tuple into the cache, a subsequent call with the same <_token_,_ type_> tuple will successfully find the target in the cache.
+
+When a dispatch stub fails frequently enough, the call site is deemed to be polymorphic and the resolve stub will back patch the call site to point directly to the resolve stub to avoid the overhead of a consistently failing dispatch stub. At sync points (currently the end of a GC), polymorphic sites will be randomly promoted back to monomorphic call sites under the assumption that the polymorphic attribute of a call site is usually temporary. If this assumption is incorrect for any particular call site, it will quickly trigger a backpatch to demote it to polymorphic again.
+
+One resolve stub is created per token, but they all use a global cache. A stub-per-token allows for a fast, effective hashing algorithm using a pre-calculated hash derived from the unchanging components of the <_token_, _type_> tuple.
+
+### Code Sequences
+
+The former interface virtual table dispatch mechanism results in a code sequence similar to this:
+
+![Figure 3](../images/virtualstubdispatch-fig3.png)
+
+And the typical stub dispatch sequence is:
+
+![Figure 1](../images/virtualstubdispatch-fig4.png)
+
+where expectedMT, failure and target are constants encoded in the stub.
+
+The typical stub sequence has the same number of instructions as the former interface dispatch mechanism, and fewer memory indirections may allow it to execute faster with a smaller working set contribution. It also results in smaller JITed code, since the bulk of the work is in the stub instead of the call site. This is only advantageous if a callsite is rarely invoked. Note that the failure branch is arranged so that x86 branch prediction will follow the success case.
+
+Current State
+=============
+
+Currently, VSD is enabled only for interface method calls but not virtual instance method calls. There were several reasons for this:
+
+- **Startup:** Startup working set and speed were hindered because of the need to generate a great deal of initial stubs.
+- **Throughput:** While interface dispatches are generally faster with VSD, virtual instance method calls suffer an unacceptable speed degradation.
+
+As a result of disabling VSD for virtual instance method calls, every type has a vtable for virtual instance methods and the implementation table described above is disabled. Dispatch maps are still present to enable interface method dispatching.
+
+Physical Architecture
+=====================
+
+For dispatch token and map implementation details, please see [clr/src/vm/contractImpl.h](https://github.com/dotnet/coreclr/blob/master/src/vm/contractimpl.h) and [clr/src/vm/contractImpl.cpp](https://github.com/dotnet/coreclr/blob/master/src/vm/contractimpl.cpp).
+
+For virtual stub dispatch implementation details, please see [clr/src/vm/virtualcallstub.h](https://github.com/dotnet/coreclr/blob/master/src/vm/virtualcallstub.h) and [clr/src/vm/virtualcallstub.cpp](https://github.com/dotnet/coreclr/blob/master/src/vm/virtualcallstub.cpp).