GCStress: try to reduce races and tolerate races better (#17330)

This change addresses races that cause spurious failures in when running GC stress on multithreaded applications. * Instruction update race Threads that hit a gc cover interrupt where gc is not safe can race to overrwrite the interrupt instruction and change it back to the original instruction. This can cause confusion when handling stress exceptions as the exception code raised by the kernel may be determined by disassembling the instruction that caused the fault, and this instruction may now change between the time the fault is raised and the instruction is disassembled. When this happens the kernel may report an ACCESS_VIOLATION where there was actually an attempt to execute a priveledged instruction. x86 already had a tolerance mechanism here where when gc stress was active and the exception status was ACCESS_VIOLATION the faulting instruction would be retried to see if it faults the same way again. In this change we extend this to tolerance to cover x64 and also enable it regardless of the gc mode. We use the exception information to further screen as these spurious AVs look like reads from address 0xFF..FF. * Instrumentation vs execution race The second race happens when one thread is jitting a method and another is about to call the method. The first thread finishes jitting and publishes the method code, then starts instrumenting the method for gc coverage. While this instrumentation is ongoing, the second thread then calls the method and hits a gc interrupt instruction. The code that recognizes the fault as a gc coverage interrupt gets confused as the instrumentation is not yet complete -- in particular the m_GcCover member of the MethodDesc is not yet set. So the second thread triggers an assert. The fix for this is to instrument for GcCoverage before publishing the code. Since multiple threads can be jitting a method concurrently the instrument and public steps are done under a lock to ensure that the instrumentation and code are consistent (come from the same thread). With this lock in place we have removed the secondary locking done in SetupGcCoverage as it is no longer needed; only one thread can be instrumenting a given jitted method for GcCoverage. However we retain a bailout` clause that first looks to see if m_GcCover is set and if so skips instrumentation, as there are prejit and rejit cases where we will retry instrumentation. * Instruction cache flushes In some cases when replacing the interrupt instruction with the original the instruction cache was either not flushed or not flushed with sufficient length. This possibly leads to an increased frequency of the above races. No impact expected for non-gc stress scenarios, though some of the code changes are in common code paths. Addresses the spurious GC stress failures seen in #17027 and #17610.
author: Andy Ayers <andya@microsoft.com> 2018-04-19 10:17:11 -0700
committer: GitHub <noreply@github.com> 2018-04-19 10:17:11 -0700
commit: 571b1a7c84aa264afe6a33bd58eca8c9c10052ff (patch)
tree: 238ff06b6a076af59bd9e1a28f115c6c41a32adf /src/vm/gccover.cpp
parent: 204c11b8309bf243b9255ddf7f17820cd320cf4d (diff)
download: coreclr-571b1a7c84aa264afe6a33bd58eca8c9c10052ff.tar.gz
coreclr-571b1a7c84aa264afe6a33bd58eca8c9c10052ff.tar.bz2
coreclr-571b1a7c84aa264afe6a33bd58eca8c9c10052ff.zip
1 files changed, 25 insertions, 60 deletions
diff --git a/src/vm/gccover.cpp b/src/vm/gccover.cpp
index d61a168f47..ca91687887 100644
--- a/src/vm/gccover.cpp
+++ b/src/vm/gccover.cpp
@@ -144,69 +144,31 @@ void SetupGcCoverage(MethodDesc* pMD, BYTE* methodStartPtr) {
     }
 #endif
 
-    if (pMD->m_GcCover)
-        return;
-
+    // Ideally we would assert here that m_GcCover is NULL.
+    //
+    // However, we can't do that (at least not yet), because we may
+    // invoke this method more than once on a given
+    // MethodDesc. Examples include prejitted methods and rejitted
+    // methods.
     //
-    // In the gcstress=4 case, we can easily piggy-back onto the JITLock because we
-    // have a JIT operation that needs to take that lock already.  But in the case of
-    // gcstress=8, we cannot do this because the code already exists, and if gccoverage
-    // were not in the picture, we're happy to race to do the prestub work because all
-    // threads end up with the same answer and don't leak any resources in the process.
-    // 
-    // However, with gccoverage, we need to exclude all other threads from mucking with
-    // the code while we fill in the breakpoints and make our shadow copy of the code.
+    // In the prejit case, we can't safely re-instrument an already
+    // instrumented method. By bailing out here, we will use the
+    // original instrumentation, which should still be valid as
+    // the method code has not changed.
     //
+    // In the rejit case, the old method code may still be active and
+    // instrumented, so we need to preserve that gc cover info.  By
+    // bailing out here we will skip instrumenting the rejitted native
+    // code, and since the rejitted method does not get instrumented
+    // we should be able to tolerate that the gc cover info does not
+    // match.
+    if (pMD->m_GcCover)
     {
-        BaseDomain* pDomain = pMD->GetDomain();
-        // Enter the global lock which protects the list of all functions being JITd
-        JitListLock::LockHolder pJitLock(pDomain->GetJitLock());
-
-
-        // It is possible that another thread stepped in before we entered the global lock for the first time.
-        if (pMD->m_GcCover)
-        {
-            // We came in to jit but someone beat us so return the jitted method!
-            return;
-        }
-        else
-        {
-            const char *description = "jit lock (gc cover)";
-#ifdef _DEBUG 
-            description = pMD->m_pszDebugMethodName;
-#endif
-            ReleaseHolder<JitListLockEntry> pEntry(JitListLockEntry::Find(pJitLock, pMD->GetInitialCodeVersion(), description));
-
-            // We have an entry now, we can release the global lock
-            pJitLock.Release();
-
-            // Take the entry lock
-            {
-                JitListLockEntry::LockHolder pEntryLock(pEntry, FALSE);
-
-                if (pEntryLock.DeadlockAwareAcquire())
-                {
-                    // we have the lock...
-                }
-                else
-                {
-                    // Note that at this point we don't have the lock, but that's OK because the
-                    // thread which does have the lock is blocked waiting for us.
-                }
-
-                if (pMD->m_GcCover)
-                {
-                    return;
-                }
-
-                PCODE codeStart = (PCODE) methodStartPtr;
-
-                SetupAndSprinkleBreakpointsForJittedMethod(pMD, 
-                                                           codeStart
-                                                          );
-            }
-        }
+        return;
     }
+
+    PCODE codeStart = (PCODE) methodStartPtr;
+    SetupAndSprinkleBreakpointsForJittedMethod(pMD, codeStart);
 }
 
 #ifdef FEATURE_PREJIT
@@ -1305,6 +1267,8 @@ void RemoveGcCoverageInterrupt(TADDR instrPtr, BYTE * savedInstrPtr)
 #else
         *(BYTE *)instrPtr = *savedInstrPtr;
 #endif
+
+        FlushInstructionCache(GetCurrentProcess(), (LPCVOID)instrPtr, 4);
 }
 
 BOOL OnGcCoverageInterrupt(PCONTEXT regs)
@@ -1677,7 +1641,8 @@ void DoGcStress (PCONTEXT regs, MethodDesc *pMD)
         }
 
         // Must flush instruction cache before returning as instruction has been modified.
-        FlushInstructionCache(GetCurrentProcess(), (LPCVOID)instrPtr, 6);
+        // Note this needs to reach beyond the call by up to 4 bytes.
+        FlushInstructionCache(GetCurrentProcess(), (LPCVOID)instrPtr, 10);
 
         // It's not GC safe point, the GC Stress instruction is 
         // already commited and interrupt is already put at next instruction so we just return.
author	Andy Ayers <andya@microsoft.com>	2018-04-19 10:17:11 -0700
committer	GitHub <noreply@github.com>	2018-04-19 10:17:11 -0700
commit	571b1a7c84aa264afe6a33bd58eca8c9c10052ff (patch)
tree	238ff06b6a076af59bd9e1a28f115c6c41a32adf /src/vm/gccover.cpp
parent	204c11b8309bf243b9255ddf7f17820cd320cf4d (diff)
download	coreclr-571b1a7c84aa264afe6a33bd58eca8c9c10052ff.tar.gz coreclr-571b1a7c84aa264afe6a33bd58eca8c9c10052ff.tar.bz2 coreclr-571b1a7c84aa264afe6a33bd58eca8c9c10052ff.zip