GCStress: try to reduce races and tolerate races better (#17330)

This change addresses races that cause spurious failures in when running GC stress on multithreaded applications. * Instruction update race Threads that hit a gc cover interrupt where gc is not safe can race to overrwrite the interrupt instruction and change it back to the original instruction. This can cause confusion when handling stress exceptions as the exception code raised by the kernel may be determined by disassembling the instruction that caused the fault, and this instruction may now change between the time the fault is raised and the instruction is disassembled. When this happens the kernel may report an ACCESS_VIOLATION where there was actually an attempt to execute a priveledged instruction. x86 already had a tolerance mechanism here where when gc stress was active and the exception status was ACCESS_VIOLATION the faulting instruction would be retried to see if it faults the same way again. In this change we extend this to tolerance to cover x64 and also enable it regardless of the gc mode. We use the exception information to further screen as these spurious AVs look like reads from address 0xFF..FF. * Instrumentation vs execution race The second race happens when one thread is jitting a method and another is about to call the method. The first thread finishes jitting and publishes the method code, then starts instrumenting the method for gc coverage. While this instrumentation is ongoing, the second thread then calls the method and hits a gc interrupt instruction. The code that recognizes the fault as a gc coverage interrupt gets confused as the instrumentation is not yet complete -- in particular the m_GcCover member of the MethodDesc is not yet set. So the second thread triggers an assert. The fix for this is to instrument for GcCoverage before publishing the code. Since multiple threads can be jitting a method concurrently the instrument and public steps are done under a lock to ensure that the instrumentation and code are consistent (come from the same thread). With this lock in place we have removed the secondary locking done in SetupGcCoverage as it is no longer needed; only one thread can be instrumenting a given jitted method for GcCoverage. However we retain a bailout` clause that first looks to see if m_GcCover is set and if so skips instrumentation, as there are prejit and rejit cases where we will retry instrumentation. * Instruction cache flushes In some cases when replacing the interrupt instruction with the original the instruction cache was either not flushed or not flushed with sufficient length. This possibly leads to an increased frequency of the above races. No impact expected for non-gc stress scenarios, though some of the code changes are in common code paths. Addresses the spurious GC stress failures seen in #17027 and #17610.
author: Andy Ayers <andya@microsoft.com> 2018-04-19 10:17:11 -0700
committer: GitHub <noreply@github.com> 2018-04-19 10:17:11 -0700
commit: 571b1a7c84aa264afe6a33bd58eca8c9c10052ff (patch)
tree: 238ff06b6a076af59bd9e1a28f115c6c41a32adf /src/vm/i386
parent: 204c11b8309bf243b9255ddf7f17820cd320cf4d (diff)
download: coreclr-571b1a7c84aa264afe6a33bd58eca8c9c10052ff.tar.gz
coreclr-571b1a7c84aa264afe6a33bd58eca8c9c10052ff.tar.bz2
coreclr-571b1a7c84aa264afe6a33bd58eca8c9c10052ff.zip
1 files changed, 4 insertions, 5 deletions
diff --git a/src/vm/i386/excepx86.cpp b/src/vm/i386/excepx86.cpp
index 9f19d47440..9f558c457e 100644
--- a/src/vm/i386/excepx86.cpp
+++ b/src/vm/i386/excepx86.cpp
@@ -1347,7 +1347,7 @@ CPFH_FirstPassHandler(EXCEPTION_RECORD *pExceptionRecord,
     // Check to see if this exception is due to GCStress. Since the GCStress mechanism only injects these faults
     // into managed code, we only need to check for them in CPFH_FirstPassHandler.
     //
-    if (IsGcMarker(exceptionCode, pContext))
+    if (IsGcMarker(pContext, pExceptionRecord))
     {
         return ExceptionContinueExecution;
     }
@@ -1675,7 +1675,7 @@ EXCEPTION_HANDLER_IMPL(COMPlusFrameHandler)
         // it is very easy to trash the last error. For example, a p/invoke called a native method
         // which sets last error. Before we getting the last error in the IL stub, it is trashed here
         DWORD dwLastError = GetLastError();
-        fIsGCMarker = IsGcMarker(pExceptionRecord->ExceptionCode, pContext);
+        fIsGCMarker = IsGcMarker(pContext, pExceptionRecord);
         if (!fIsGCMarker)
         {
             SaveCurrentExceptionInfo(pExceptionRecord, pContext);
@@ -3693,12 +3693,11 @@ AdjustContextForVirtualStub(
     pExceptionRecord->ExceptionAddress = (PVOID)callsite;
     SetIP(pContext, callsite);
 
-#ifdef HAVE_GCCOVER
+#if defined(GCCOVER_TOLERATE_SPURIOUS_AV)
     // Modify LastAVAddress saved in thread to distinguish between fake & real AV
     // See comments in IsGcMarker in file excep.cpp for more details
     pThread->SetLastAVAddress((LPVOID)GetIP(pContext));
-#endif    
-
+#endif // defined(GCCOVER_TOLERATE_SPURIOUS_AV)
 
     // put ESP back to what it was before the call.
     SetSP(pContext, dac_cast<PCODE>(dac_cast<PTR_BYTE>(GetSP(pContext)) + sizeof(void*)));
author	Andy Ayers <andya@microsoft.com>	2018-04-19 10:17:11 -0700
committer	GitHub <noreply@github.com>	2018-04-19 10:17:11 -0700
commit	571b1a7c84aa264afe6a33bd58eca8c9c10052ff (patch)
tree	238ff06b6a076af59bd9e1a28f115c6c41a32adf /src/vm/i386
parent	204c11b8309bf243b9255ddf7f17820cd320cf4d (diff)
download	coreclr-571b1a7c84aa264afe6a33bd58eca8c9c10052ff.tar.gz coreclr-571b1a7c84aa264afe6a33bd58eca8c9c10052ff.tar.bz2 coreclr-571b1a7c84aa264afe6a33bd58eca8c9c10052ff.zip