Age | Commit message (Collapse) | Author | Files | Lines |
|
This change removes CPU groups emulation from Unix PAL and modifies the
GC and thread pool code accordingly.
|
|
- Remove concept of AppDomain from object api in VM
- Various infrastructure around entering/leaving appdomains is removed
- Add small implementation of GetAppDomain for use by DAC (to match existing behavior)
- Simplify finalizer thread operations
- Eliminate AppDomain::Terminate
- Remove use of ADID from stresslog
- Remove thread enter/leave tracking from AppDomain
- Remove unused asm constants across all architectures
- Re-order header inclusion order to put gcenv.h before handletable
- Remove retail only sync block code involving appdomain index
|
|
|
|
|
|
* Remove dead ContainToAppDomain
* Respond to feedback
|
|
Add Serialization Guard API and consume it in CoreLib targets
|
|
* start ripping out eventpipe buffer to tls
* can now emit events from gc threads
* cleanup
* more cleanup
* more cleanup
* tested on linux
* Addressing PR comments
* Move things around a bit to build in Linux
* change eventpipe buffer deallocation code
* more cleanup
* this while loop doesnt do anything now
* Fix build
* fixing build
* More cleanup
* more pr comments
* Fix unix build
* more pr comments
* trying to add a message to assertion that seems to be causing CIs to fail
* more pr feedback
* handle non-2-byte aligned string payloads inside payload buffers
* some more cleanup
* Fix off by one error in null index calculation
* Make Get/SetThreadEventBufferList a static member of ThreadEventBufferList
* make only the methods public in ThreadEventBufferList
* Addressing noah's comments
* fix comment and last off by 1 error
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Large portion of the current culture handling in the unmanaged runtime inherited from desktop has been no-op. The nativeInitCultureAccessors QCall that it used to depend on desktop got (almost) never called in CoreCLR.
- Delete resetting of current culture on threadpool threads. It was needed in desktop because of a very tricky flow of current culture between appdomains. It is superseded by the flowing the current culture via AsyncLocal in CoreCLR.
- Comment out fetch of managed current culture for unmanaged resource lookup. It has number of problems that are not easy to fix. We are not localizing the unmanaged runtime currently anyway, so it is ok to just comment it out.
- Fix the rest to call CultureInfo directly without going through Thread.CurrentThread
|
|
* Delete vm/context.*
Leftover from remoting
|
|
* Remove IsNeutralDomain()
* PR feedback
|
|
|
|
|
|
* Fix LoaderAllocator::AllocateHandle
When another thread wins the race in growing the handle table, the code
was not refreshing the slotsUsed local to the new up to date value. This
was leading to overwriting / reusing a live handle.
This change fixes it.
* Embed ThreadLocalBlock in Thread
Instead of allocating ThreadLocalBlock dynamically, embed it in the
Thread. That solves race issue between thread destruction and
LoaderAllocator destruction. The ThreadLocalBlock could have been
deleted during Thread shutdown while the LoaderAllocator's destruction
would be working with it.
|
|
|
|
* Change GetAppDomain to return it from the global static
The current implementation of the GetAppDomain takes it from the TLS for
the current thread. But we only have one AppDomain in the system, so we
can change it to return just that one.
I have still left the ThreadLocalInfo.m_pAppDomain and its setter
present, because SOS uses that to access the AppDomain and the SOS needs
to be runtime versino agnostic.
This makes it to perform better for Unix where accessing TLS is not
trivial.
* Move the AppDomain instance pointer to own static
To enable access to the one and only AppDomain without unnecessary
indirections, I have moved the pointer out of the SystemDomain class.
|
|
|
|
|
|
nits on Windows (#20730)
* Remove implicit c-string const casting and clean up some C++ standards conformance bugs.
* Fix const string conversion in FCSigCheck.
|
|
|
|
Since there is only one AppDomain, there is no need for a per-AppDomain
TLB table for each Thread. This change removes that table and thus gets
rid of the extra indirection needed to access the TLB.
|
|
* Enable thread statics for collectible classes
This change removes checks that were preventing usage of thread statics
in collectible classes and also implements all the necessary changes.
The handles that hold arrays with thread statics are allocated from
LoaderAllocator for collectible classes instead of using the global
strong handle like in the case of non-collectible classes.
The change very much mimics what is done for regular statics.
This change also adds ability to reuse freed handles to the
LoaderAllocator handle table. Freed handle indexes are stored into a
stack and when a new handle allocation is requested, the indices from
this stack are used first.
Due to the code path from which the FreeTLM that in turn frees the
handles is called, I had to modify the critical section flags and also
refactor the handle allocation so that the actual managed array
representing the handle table is allocated out of the critical section.
When I was touching the code, I have also moved the code that was
dealing with handles that are not stored in the LoaderAllocator handle
tables out of the critical section, since there is no point in having it
inside of it.
|
|
* Remove AppDomain unload
This change removes all code in AppDomain that's related to AppDomain
unloading which is obsolete in CoreCLR. It also removes all calls to the
removed methods.
In few places, I have made the change simpler by taking into account the
fact that there is always just one AppDomain.
|
|
This bug fix is a port from the equivalent fix in framework. The
debugger tried performing a stackwalk in the epilog due to the JIT
incorrectly reporting epilogue information. This caused an invalid
GS cookie to be checked and caused the debugger to crash. A flag was
added to allow debug stackwalks to skip the cookie check.
|
|
|
|
|
|
Fix GCStress assertion
|
|
There was already some support for labeling threads using the Window SetThreadDescription API, however it was missing some important cases (like labeling the ThreadPool and GC server and Background threads). Fix this. Also make the naming consistant (they all start with .NET).
These names show up in PerfView traces and can be used by debuggers or other profilers as well.
|
|
|
|
This change addresses races that cause spurious failures in when running
GC stress on multithreaded applications.
* Instruction update race
Threads that hit a gc cover interrupt where gc is not safe can race to
overrwrite the interrupt instruction and change it back to the original
instruction.
This can cause confusion when handling stress exceptions as the exception code
raised by the kernel may be determined by disassembling the instruction that
caused the fault, and this instruction may now change between the time the
fault is raised and the instruction is disassembled. When this happens the
kernel may report an ACCESS_VIOLATION where there was actually an attempt to
execute a priveledged instruction.
x86 already had a tolerance mechanism here where when gc stress was active
and the exception status was ACCESS_VIOLATION the faulting instruction would
be retried to see if it faults the same way again. In this change we extend
this to tolerance to cover x64 and also enable it regardless of the gc mode.
We use the exception information to further screen as these spurious AVs look
like reads from address 0xFF..FF.
* Instrumentation vs execution race
The second race happens when one thread is jitting a method and another is
about to call the method. The first thread finishes jitting and publishes the
method code, then starts instrumenting the method for gc coverage. While this
instrumentation is ongoing, the second thread then calls the method and hits
a gc interrupt instruction. The code that recognizes the fault as a gc coverage
interrupt gets confused as the instrumentation is not yet complete -- in
particular the m_GcCover member of the MethodDesc is not yet set. So the second
thread triggers an assert.
The fix for this is to instrument for GcCoverage before publishing the code.
Since multiple threads can be jitting a method concurrently the instrument and
public steps are done under a lock to ensure that the instrumentation and code
are consistent (come from the same thread).
With this lock in place we have removed the secondary locking done in
SetupGcCoverage as it is no longer needed; only one thread can be instrumenting
a given jitted method for GcCoverage.
However we retain a bailout` clause that first looks to see if m_GcCover is
set and if so skips instrumentation, as there are prejit and rejit cases where we
will retry instrumentation.
* Instruction cache flushes
In some cases when replacing the interrupt instruction with the original the
instruction cache was either not flushed or not flushed with sufficient length.
This possibly leads to an increased frequency of the above races.
No impact expected for non-gc stress scenarios, though some of the code changes
are in common code paths.
Addresses the spurious GC stress failures seen in #17027 and #17610.
|
|
Eliminate `FEATURE_UNIX_AMD64_STRUCT_PASSING` and replace it with `UNIX_AMD64_ABI` when used alone. Both are currently defined; it is highly unlikely the latter will work alone; and it significantly clutters up the code, especially the JIT.
Also, fix the altjit support (now `UNIX_AMD64_ABI_ITF`) to *not* call `ClassifyEightBytes` if the struct is too large. Otherwise it asserts.
|
|
Fixed mixed mode attach/JIT debugging.
The mixed mode debugging attach uses TLS slot to communicate between debugger break-in thread and the right side. Unfortunately, the __thread static variables cannot be used on debugger breakin
thread because of it does not have storage allocated for them.
The fix is to switch the storage for debugger word to classic TlsAlloc allocated slot that works
fine on debugger break-in thread.
There was also problem (that is also in 2.0) where the WINNT_OFFSETOF__TEB__ThreadLocalStoragePointer was using the define for 64/32 bit and ended up always the 32 bit Windows value. This caused the right side GetEEThreadValue, GetEETlsDataBlock unmanaged thread functions to always fail.
|
|
Part of fix for https://github.com/dotnet/coreclr/issues/10441
|
|
|
|
|
|
|
|
Move YieldProcessorNormalized into separate files
Clean up YieldProcessorNormalized
|
|
Linux and Windows arm64 are using the regular C/C++ thread local statics. This change unifies the remaining Windows architectures to be on the same plan.
|
|
Move initialization of YieldProcessorNormalized to the finalizer thread
Fixes https://github.com/dotnet/coreclr/issues/13984
- Also moved relevant functions out of the Thread class as requested in the issue
- For some reason, after moving the functions out of the Thread class, YieldProcessorNormalized was not getting inlined anymore. It seems to be important to have it be inlined such that the memory loads are hoisted out of outer loops. To remove the dependency on the compiler to do it (even with forceinline it's not possible to hoist sometimes, for instance InterlockedCompareExchnage loops), changed the signatures to do what is intended.
|
|
CrstStatic (#13857)
Fixes https://github.com/dotnet/coreclr/issues/13779
|
|
* Add normalized equivalent of YieldProcessor, retune some spin loops
Part of fix for https://github.com/dotnet/coreclr/issues/13388
Normalized equivalent of YieldProcessor
- The delay incurred by YieldProcessor is measured once lazily at run-time
- Added YieldProcessorNormalized that yields for a specific duration (the duration is approximately equal to what was measured for one YieldProcessor on a Skylake processor, about 125 cycles). The measurement calculates how many YieldProcessor calls are necessary to get a delay close to the desired duration.
- Changed Thread.SpinWait to use YieldProcessorNormalized
Thread.SpinWait divide count by 7 experiment
- At this point I experimented with changing Thread.SpinWait to divide the requested number of iterations by 7, to see how it fares on perf. On my Sandy Bridge processor, 7 * YieldProcessor == YieldProcessorNormalized. See numbers in PR below.
- Not too many regressions, and the overall perf is somewhat as expected - not much change on Sandy Bridge processor, significant improvement on Skylake processor.
- I'm discounting the SemaphoreSlim throughput score because it seems to be heavily dependent on Monitor. It would be more interesting to revisit SemaphoreSlim after retuning Monitor's spin heuristics.
- ReaderWriterLockSlim seems to perform worse on Skylake, the current spin heuristics are not translating well
Spin tuning
- At this point, I abandoned the experiment above and tried to retune spins that use Thread.SpinWait
- General observations
- YieldProcessor stage
- At this stage in many places we're currently doing very long spins on YieldProcessor per iteration of the spin loop. In the last YieldProcessor iteration, it amounts to about 70 K cycles on Sandy Bridge and 512 K cycles on Skylake.
- Long spins on YieldProcessor don't let other work run efficiently. Especially when many scheduled threads all issue a long YieldProcessor, a significant portion of the processor can go unused for a long time.
- Long spins on YieldProcessor is in some cases helping to reduce contention in high-contention cases, effectively taking away some threads into a long delay. Sleep(1) works much better but has a much higher delay so it's not always appropriate. In other cases, I found that it's better to do more iterations with a shorter YieldProcessor. It would be even better to reduce the contention in the app or to have a proper wait in the sync object, where appropriate.
- Updated the YieldProcessor measurement above to calculate the number of YieldProcessorNormalized calls that amount to about 900 cycles (this was tuned based on perf), and modified SpinWait's YieldProcessor stage to cap the number of iterations passed to Thread.SpinWait. Effectively, the first few iterations have a longer delay than before on Sandy Bridge and a shorter delay than before on Skylake, and the later iterations have a much shorter delay than before on both.
- Yield/Sleep(0) stage
- Observed a couple of issues:
- When there are no threads to switch to, Yield and Sleep(0) become no-op and it turns the spin loop into a busy-spin that may quickly reach the max spin count and cause the thread to enter a wait state, or may just busy-spin for longer than desired before a Sleep(1). Completing the spin loop too early can cause excessive context switcing if a wait follows, and entering the Sleep(1) stage too early can cause excessive delays.
- If there are multiple threads doing Yield and Sleep(0) (typically from the same spin loop due to contention), they may switch between one another, delaying work that can make progress.
- I found that it works well to interleave a Yield/Sleep(0) with YieldProcessor, it enforces a minimum delay for this stage. Modified SpinWait to do this until it reaches the Sleep(1) threshold.
- Sleep(1) stage
- I didn't see any benefit in the tests to interleave Sleep(1) calls with some Yield/Sleep(0) calls, perf seemed to be a bit worse actually. If the Sleep(1) stage is reached, there is probably a lot of contention and the Sleep(1) stage helps to remove some threads from the equation for a while. Adding some Yield/Sleep(0) in-between seems to add back some of that contention.
- Modified SpinWait to use a Sleep(1) threshold, after which point it only does Sleep(1) on each spin iteration
- For the Sleep(1) threshold, I couldn't find one constant that works well in all cases
- For spin loops that are followed by a proper wait (such as a wait on an event that is signaled when the resource becomes available), they benefit from not doing Sleep(1) at all, and spinning in other stages for longer
- For infinite spin loops, they usually seemed to benefit from a lower Sleep(1) threshold to reduce contention, but the threshold also depends on other factors like how much work is done in each spin iteration, how efficient waiting is, and whether waiting has any negative side-effects.
- Added an internal overload of SpinWait.SpinOnce to take the Sleep(1) threshold as a parameter
- SpinWait - Tweaked the spin strategy as mentioned above
- ManualResetEventSlim - Changed to use SpinWait, retuned the default number of iterations (total delay is still significantly less than before). Retained the previous behavior of having Sleep(1) if a higher spin count is requested.
- Task - It was using the same heuristics as ManualResetEventSlim, copied the changes here as well
- SemaphoreSlim - Changed to use SpinWait, retuned similarly to ManualResetEventSlim but with double the number of iterations because the wait path is a lot more expensive
- SpinLock - SpinLock was using very long YieldProcessor spins. Changed to use SpinWait, removed process count multiplier, simplified.
- ReaderWriterLockSlim - This one is complicated as there are many issues. The current spin heuristics performed better even after normalizing Thread.SpinWait but without changing the SpinWait iterations (the delay is longer than before), so I left this one as is.
- The perf (see numbers in PR below) seems to be much better than both the baseline and the Thread.SpinWait divide by 7 experiment
- On Sandy Bridge, I didn't see many significant regressions. ReaderWriterLockSlim is a bit worse in some cases and a bit better in other similar cases, but at least the really low scores in the baseline got much better and not the other way around.
- On Skylake, some significant regressions are in SemaphoreSlim throughput (which I'm discounting as I mentioned above in the experiment) and CountdownEvent add/signal throughput. The latter can probably be improved later.
|
|
* Added SetThreadDescription to set the unmanaged thread name as well when a managed thread name was set.
This will show up in future debuggers which know how to read that information or in ETW traces in the Thread Name column.
* use printf instead of wprintf which exists on all platforms.
* Removed printf
Ensure that GetProceAddress is only called once to when the method is not present.
Potential perf hit should be negligible since setting a thread name can only happen once per managed thread.
* - Moved SetThreadName code to winfix.cpp as proposed
- Finalizer and threadpool threads get their name
- GCToEEInterface::CreateBackgroundThread is also named
- but regular GC threads have no name because when I included utilcode.h things did break apart.
* Fix for data race in g_pfnSetThreadDescription
* Fix string literals on unix builds.
* Fixed nits
Settled thread name on ".NET Core ThreadPool"
|
|
|
|
* Fix build errors when TRACK_SYNC is defined
* Remove unnecessary default constructor
|
|
|