Add normalized equivalent of YieldProcessor, retune some spin loops (#13670)

* Add normalized equivalent of YieldProcessor, retune some spin loops Part of fix for https://github.com/dotnet/coreclr/issues/13388 Normalized equivalent of YieldProcessor - The delay incurred by YieldProcessor is measured once lazily at run-time - Added YieldProcessorNormalized that yields for a specific duration (the duration is approximately equal to what was measured for one YieldProcessor on a Skylake processor, about 125 cycles). The measurement calculates how many YieldProcessor calls are necessary to get a delay close to the desired duration. - Changed Thread.SpinWait to use YieldProcessorNormalized Thread.SpinWait divide count by 7 experiment - At this point I experimented with changing Thread.SpinWait to divide the requested number of iterations by 7, to see how it fares on perf. On my Sandy Bridge processor, 7 * YieldProcessor == YieldProcessorNormalized. See numbers in PR below. - Not too many regressions, and the overall perf is somewhat as expected - not much change on Sandy Bridge processor, significant improvement on Skylake processor. - I'm discounting the SemaphoreSlim throughput score because it seems to be heavily dependent on Monitor. It would be more interesting to revisit SemaphoreSlim after retuning Monitor's spin heuristics. - ReaderWriterLockSlim seems to perform worse on Skylake, the current spin heuristics are not translating well Spin tuning - At this point, I abandoned the experiment above and tried to retune spins that use Thread.SpinWait - General observations - YieldProcessor stage - At this stage in many places we're currently doing very long spins on YieldProcessor per iteration of the spin loop. In the last YieldProcessor iteration, it amounts to about 70 K cycles on Sandy Bridge and 512 K cycles on Skylake. - Long spins on YieldProcessor don't let other work run efficiently. Especially when many scheduled threads all issue a long YieldProcessor, a significant portion of the processor can go unused for a long time. - Long spins on YieldProcessor is in some cases helping to reduce contention in high-contention cases, effectively taking away some threads into a long delay. Sleep(1) works much better but has a much higher delay so it's not always appropriate. In other cases, I found that it's better to do more iterations with a shorter YieldProcessor. It would be even better to reduce the contention in the app or to have a proper wait in the sync object, where appropriate. - Updated the YieldProcessor measurement above to calculate the number of YieldProcessorNormalized calls that amount to about 900 cycles (this was tuned based on perf), and modified SpinWait's YieldProcessor stage to cap the number of iterations passed to Thread.SpinWait. Effectively, the first few iterations have a longer delay than before on Sandy Bridge and a shorter delay than before on Skylake, and the later iterations have a much shorter delay than before on both. - Yield/Sleep(0) stage - Observed a couple of issues: - When there are no threads to switch to, Yield and Sleep(0) become no-op and it turns the spin loop into a busy-spin that may quickly reach the max spin count and cause the thread to enter a wait state, or may just busy-spin for longer than desired before a Sleep(1). Completing the spin loop too early can cause excessive context switcing if a wait follows, and entering the Sleep(1) stage too early can cause excessive delays. - If there are multiple threads doing Yield and Sleep(0) (typically from the same spin loop due to contention), they may switch between one another, delaying work that can make progress. - I found that it works well to interleave a Yield/Sleep(0) with YieldProcessor, it enforces a minimum delay for this stage. Modified SpinWait to do this until it reaches the Sleep(1) threshold. - Sleep(1) stage - I didn't see any benefit in the tests to interleave Sleep(1) calls with some Yield/Sleep(0) calls, perf seemed to be a bit worse actually. If the Sleep(1) stage is reached, there is probably a lot of contention and the Sleep(1) stage helps to remove some threads from the equation for a while. Adding some Yield/Sleep(0) in-between seems to add back some of that contention. - Modified SpinWait to use a Sleep(1) threshold, after which point it only does Sleep(1) on each spin iteration - For the Sleep(1) threshold, I couldn't find one constant that works well in all cases - For spin loops that are followed by a proper wait (such as a wait on an event that is signaled when the resource becomes available), they benefit from not doing Sleep(1) at all, and spinning in other stages for longer - For infinite spin loops, they usually seemed to benefit from a lower Sleep(1) threshold to reduce contention, but the threshold also depends on other factors like how much work is done in each spin iteration, how efficient waiting is, and whether waiting has any negative side-effects. - Added an internal overload of SpinWait.SpinOnce to take the Sleep(1) threshold as a parameter - SpinWait - Tweaked the spin strategy as mentioned above - ManualResetEventSlim - Changed to use SpinWait, retuned the default number of iterations (total delay is still significantly less than before). Retained the previous behavior of having Sleep(1) if a higher spin count is requested. - Task - It was using the same heuristics as ManualResetEventSlim, copied the changes here as well - SemaphoreSlim - Changed to use SpinWait, retuned similarly to ManualResetEventSlim but with double the number of iterations because the wait path is a lot more expensive - SpinLock - SpinLock was using very long YieldProcessor spins. Changed to use SpinWait, removed process count multiplier, simplified. - ReaderWriterLockSlim - This one is complicated as there are many issues. The current spin heuristics performed better even after normalizing Thread.SpinWait but without changing the SpinWait iterations (the delay is longer than before), so I left this one as is. - The perf (see numbers in PR below) seems to be much better than both the baseline and the Thread.SpinWait divide by 7 experiment - On Sandy Bridge, I didn't see many significant regressions. ReaderWriterLockSlim is a bit worse in some cases and a bit better in other similar cases, but at least the really low scores in the baseline got much better and not the other way around. - On Skylake, some significant regressions are in SemaphoreSlim throughput (which I'm discounting as I mentioned above in the experiment) and CountdownEvent add/signal throughput. The latter can probably be improved later.
author: Koundinya Veluri <kouvel@users.noreply.github.com> 2017-09-01 13:09:40 -0700
committer: GitHub <noreply@github.com> 2017-09-01 13:09:40 -0700
commit: 03bf95c8db9003a5925ca9383dca722a4c651e27 (patch)
tree: 5a0087d03ba2dcb4f319a9a104a9f76702fdd82c /src
parent: 12db0a3ccf42ab21333872cc3984009aecd06eeb (diff)
download: coreclr-03bf95c8db9003a5925ca9383dca722a4c651e27.tar.gz
coreclr-03bf95c8db9003a5925ca9383dca722a4c651e27.tar.bz2
coreclr-03bf95c8db9003a5925ca9383dca722a4c651e27.zip
11 files changed, 325 insertions, 148 deletions
diff --git a/src/mscorlib/shared/System/Threading/SpinWait.cs b/src/mscorlib/shared/System/Threading/SpinWait.cs
index d25d54f26f..5346e8d17b 100644
--- a/src/mscorlib/shared/System/Threading/SpinWait.cs
+++ b/src/mscorlib/shared/System/Threading/SpinWait.cs
@@ -69,9 +69,26 @@ namespace System.Threading
         // numbers may seem fairly arbitrary, but were derived with at least some
         // thought in the design document.  I fully expect they will need to change
         // over time as we gain more experience with performance.
-        internal const int YIELD_THRESHOLD = 10; // When to switch over to a true yield.
-        internal const int SLEEP_0_EVERY_HOW_MANY_TIMES = 5; // After how many yields should we Sleep(0)?
-        internal const int SLEEP_1_EVERY_HOW_MANY_TIMES = 20; // After how many yields should we Sleep(1)?
+        internal const int YieldThreshold = 10; // When to switch over to a true yield.
+        private const int Sleep0EveryHowManyYields = 5; // After how many yields should we Sleep(0)?
+        internal const int DefaultSleep1Threshold = 20; // After how many yields should we Sleep(1) frequently?
+
+        /// <summary>
+        /// A suggested number of spin iterations before doing a proper wait, such as waiting on an event that becomes signaled
+        /// when the resource becomes available.
+        /// </summary>
+        /// <remarks>
+        /// These numbers were arrived at by experimenting with different numbers in various cases that currently use it. It's
+        /// only a suggested value and typically works well when the proper wait is something like an event.
+        /// 
+        /// Spinning less can lead to early waiting and more context switching, spinning more can decrease latency but may use
+        /// up some CPU time unnecessarily. Depends on the situation too, for instance SemaphoreSlim uses double this number
+        /// because the waiting there is currently a lot more expensive (involves more spinning, taking a lock, etc.). It also
+        /// depends on the likelihood of the spin being successful and how long the wait would be but those are not accounted
+        /// for here.
+        /// </remarks>
+        internal static readonly int SpinCountforSpinBeforeWait = PlatformHelper.IsSingleProcessor ? 1 : 35;
+        internal const int Sleep1ThresholdForSpinBeforeWait = 40; // should be greater than SpinCountforSpinBeforeWait
 
         // The number of times we've spun already.
         private int _count;
@@ -81,7 +98,12 @@ namespace System.Threading
         /// </summary>
         public int Count
         {
-            get { return _count; }
+            get => _count;
+            internal set
+            {
+                Debug.Assert(value >= 0);
+                _count = value;
+            }
         }
 
         /// <summary>
@@ -94,10 +116,7 @@ namespace System.Threading
         /// On a single-CPU machine, <see cref="SpinOnce"/> always yields the processor. On machines with
         /// multiple CPUs, <see cref="SpinOnce"/> may yield after an unspecified number of calls.
         /// </remarks>
-        public bool NextSpinWillYield
-        {
-            get { return _count > YIELD_THRESHOLD || PlatformHelper.IsSingleProcessor; }
-        }
+        public bool NextSpinWillYield => _count >= YieldThreshold || PlatformHelper.IsSingleProcessor;
 
         /// <summary>
         /// Performs a single spin.
@@ -108,7 +127,27 @@ namespace System.Threading
         /// </remarks>
         public void SpinOnce()
         {
-            if (NextSpinWillYield)
+            SpinOnce(DefaultSleep1Threshold);
+        }
+
+        internal void SpinOnce(int sleep1Threshold)
+        {
+            Debug.Assert(sleep1Threshold >= YieldThreshold || PlatformHelper.IsSingleProcessor); // so that NextSpinWillYield behaves as requested
+
+            // (_count - YieldThreshold) % 2 == 0: The purpose of this check is to interleave Thread.Yield/Sleep(0) with
+            // Thread.SpinWait. Otherwise, the following issues occur:
+            //   - When there are no threads to switch to, Yield and Sleep(0) become no-op and it turns the spin loop into a
+            //     busy-spin that may quickly reach the max spin count and cause the thread to enter a wait state, or may
+            //     just busy-spin for longer than desired before a Sleep(1). Completing the spin loop too early can cause
+            //     excessive context switcing if a wait follows, and entering the Sleep(1) stage too early can cause
+            //     excessive delays.
+            //   - If there are multiple threads doing Yield and Sleep(0) (typically from the same spin loop due to
+            //     contention), they may switch between one another, delaying work that can make progress.
+            if ((
+                    _count >= YieldThreshold &&
+                    (_count >= sleep1Threshold || (_count - YieldThreshold) % 2 == 0)
+                ) ||
+                PlatformHelper.IsSingleProcessor)
             {
                 //
                 // We must yield.
@@ -125,19 +164,21 @@ namespace System.Threading
                 // configured to use the (default) coarse-grained system timer.
                 //
 
-                int yieldsSoFar = (_count >= YIELD_THRESHOLD ? _count - YIELD_THRESHOLD : _count);
-
-                if ((yieldsSoFar % SLEEP_1_EVERY_HOW_MANY_TIMES) == (SLEEP_1_EVERY_HOW_MANY_TIMES - 1))
+                if (_count >= sleep1Threshold)
                 {
                     RuntimeThread.Sleep(1);
                 }
-                else if ((yieldsSoFar % SLEEP_0_EVERY_HOW_MANY_TIMES) == (SLEEP_0_EVERY_HOW_MANY_TIMES - 1))
-                {
-                    RuntimeThread.Sleep(0);
-                }
                 else
                 {
-                    RuntimeThread.Yield();
+                    int yieldsSoFar = _count >= YieldThreshold ? (_count - YieldThreshold) / 2 : _count;
+                    if ((yieldsSoFar % Sleep0EveryHowManyYields) == (Sleep0EveryHowManyYields - 1))
+                    {
+                        RuntimeThread.Sleep(0);
+                    }
+                    else
+                    {
+                        RuntimeThread.Yield();
+                    }
                 }
             }
             else
@@ -153,11 +194,24 @@ namespace System.Threading
                 // number of spins we are willing to tolerate to reduce delay to the caller,
                 // since we expect most callers will eventually block anyway.
                 //
-                RuntimeThread.SpinWait(4 << _count);
+                // Also, cap the maximum spin count to a value such that many thousands of CPU cycles would not be wasted doing
+                // the equivalent of YieldProcessor(), as that that point SwitchToThread/Sleep(0) are more likely to be able to
+                // allow other useful work to run. Long YieldProcessor() loops can help to reduce contention, but Sleep(1) is
+                // usually better for that.
+                //
+                // RuntimeThread.OptimalMaxSpinWaitsPerSpinIteration:
+                //   - See Thread::InitializeYieldProcessorNormalized(), which describes and calculates this value.
+                //
+                int n = RuntimeThread.OptimalMaxSpinWaitsPerSpinIteration;
+                if (_count <= 30 && (1 << _count) < n)
+                {
+                    n = 1 << _count;
+                }
+                RuntimeThread.SpinWait(n);
             }
 
             // Finally, increment our spin counter.
-            _count = (_count == int.MaxValue ? YIELD_THRESHOLD : _count + 1);
+            _count = (_count == int.MaxValue ? YieldThreshold : _count + 1);
         }
 
         /// <summary>
@@ -299,9 +353,7 @@ namespace System.Threading
         /// <summary>
         /// Gets whether the current machine has only a single processor.
         /// </summary>
-        internal static bool IsSingleProcessor
-        {
-            get { return ProcessorCount == 1; }
-        }
+        /// <remarks>This typically does not change on a machine, so it's checked only once.</remarks>
+        internal static readonly bool IsSingleProcessor = ProcessorCount == 1;
     }
 }
diff --git a/src/mscorlib/src/Internal/Runtime/Augments/RuntimeThread.cs b/src/mscorlib/src/Internal/Runtime/Augments/RuntimeThread.cs
index 605f974da0..4c67ea3fd6 100644
--- a/src/mscorlib/src/Internal/Runtime/Augments/RuntimeThread.cs
+++ b/src/mscorlib/src/Internal/Runtime/Augments/RuntimeThread.cs
@@ -15,6 +15,8 @@ namespace Internal.Runtime.Augments
 {
     public class RuntimeThread : CriticalFinalizerObject
     {
+        private static int s_optimalMaxSpinWaitsPerSpinIteration;
+
         internal RuntimeThread() { }
 
         public static RuntimeThread Create(ThreadStart start) => new Thread(start);
@@ -186,6 +188,33 @@ namespace Internal.Runtime.Augments
         private extern bool JoinInternal(int millisecondsTimeout);
 
         public static void Sleep(int millisecondsTimeout) => Thread.Sleep(millisecondsTimeout);
+
+        [DllImport(JitHelpers.QCall)]
+        [SuppressUnmanagedCodeSecurity]
+        private static extern int GetOptimalMaxSpinWaitsPerSpinIterationInternal();
+
+        /// <summary>
+        /// Max value to be passed into <see cref="SpinWait(int)"/> for optimal delaying. This value is normalized to be
+        /// appropriate for the processor.
+        /// </summary>
+        internal static int OptimalMaxSpinWaitsPerSpinIteration
+        {
+            get
+            {
+                if (s_optimalMaxSpinWaitsPerSpinIteration != 0)
+                {
+                    return s_optimalMaxSpinWaitsPerSpinIteration;
+                }
+
+                // This is done lazily because the first call to the function below in the process triggers a measurement that
+                // takes a nontrivial amount of time. See Thread::InitializeYieldProcessorNormalized(), which describes and
+                // calculates this value.
+                s_optimalMaxSpinWaitsPerSpinIteration = GetOptimalMaxSpinWaitsPerSpinIterationInternal();
+                Debug.Assert(s_optimalMaxSpinWaitsPerSpinIteration > 0);
+                return s_optimalMaxSpinWaitsPerSpinIteration;
+            }
+        }
+
         public static void SpinWait(int iterations) => Thread.SpinWait(iterations);
         public static bool Yield() => Thread.Yield();
 
diff --git a/src/mscorlib/src/System/Threading/ManualResetEventSlim.cs b/src/mscorlib/src/System/Threading/ManualResetEventSlim.cs
index e396968499..8a245f0602 100644
--- a/src/mscorlib/src/System/Threading/ManualResetEventSlim.cs
+++ b/src/mscorlib/src/System/Threading/ManualResetEventSlim.cs
@@ -12,9 +12,6 @@
 //
 // =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 
-using System;
-using System.Threading;
-using System.Runtime.InteropServices;
 using System.Diagnostics;
 using System.Diagnostics.Contracts;
 
@@ -48,7 +45,6 @@ namespace System.Threading
     {
         // These are the default spin counts we use on single-proc and MP machines.
         private const int DEFAULT_SPIN_SP = 1;
-        private const int DEFAULT_SPIN_MP = SpinWait.YIELD_THRESHOLD;
 
         private volatile object m_lock;
         // A lock used for waiting and pulsing. Lazily initialized via EnsureLockObjectCreated()
@@ -193,7 +189,7 @@ namespace System.Threading
         {
             // Specify the defualt spin count, and use default spin if we're
             // on a multi-processor machine. Otherwise, we won't.
-            Initialize(initialState, DEFAULT_SPIN_MP);
+            Initialize(initialState, SpinWait.SpinCountforSpinBeforeWait);
         }
 
         /// <summary>
@@ -563,44 +559,19 @@ namespace System.Threading
                     bNeedTimeoutAdjustment = true;
                 }
 
-                //spin
-                int HOW_MANY_SPIN_BEFORE_YIELD = 10;
-                int HOW_MANY_YIELD_EVERY_SLEEP_0 = 5;
-                int HOW_MANY_YIELD_EVERY_SLEEP_1 = 20;
-
+                // Spin
                 int spinCount = SpinCount;
-                for (int i = 0; i < spinCount; i++)
+                var spinner = new SpinWait();
+                while (spinner.Count < spinCount)
                 {
+                    spinner.SpinOnce(SpinWait.Sleep1ThresholdForSpinBeforeWait);
+
                     if (IsSet)
                     {
                         return true;
                     }
 
-                    else if (i < HOW_MANY_SPIN_BEFORE_YIELD)
-                    {
-                        if (i == HOW_MANY_SPIN_BEFORE_YIELD / 2)
-                        {
-                            Thread.Yield();
-                        }
-                        else
-                        {
-                            Thread.SpinWait(4 << i);
-                        }
-                    }
-                    else if (i % HOW_MANY_YIELD_EVERY_SLEEP_1 == 0)
-                    {
-                        Thread.Sleep(1);
-                    }
-                    else if (i % HOW_MANY_YIELD_EVERY_SLEEP_0 == 0)
-                    {
-                        Thread.Sleep(0);
-                    }
-                    else
-                    {
-                        Thread.Yield();
-                    }
-
-                    if (i >= 100 && i % 10 == 0) // check the cancellation token if the user passed a very large spin count
+                    if (spinner.Count >= 100 && spinner.Count % 10 == 0) // check the cancellation token if the user passed a very large spin count
                         cancellationToken.ThrowIfCancellationRequested();
                 }
 
diff --git a/src/mscorlib/src/System/Threading/SemaphoreSlim.cs b/src/mscorlib/src/System/Threading/SemaphoreSlim.cs
index e9f5e633f8..3776edd47a 100644
--- a/src/mscorlib/src/System/Threading/SemaphoreSlim.cs
+++ b/src/mscorlib/src/System/Threading/SemaphoreSlim.cs
@@ -342,15 +342,28 @@ namespace System.Threading
             CancellationTokenRegistration cancellationTokenRegistration = cancellationToken.InternalRegisterWithoutEC(s_cancellationTokenCanceledEventHandler, this);
             try
             {
-                // Perf: first spin wait for the count to be positive, but only up to the first planned yield.
+                // Perf: first spin wait for the count to be positive.
                 //       This additional amount of spinwaiting in addition
                 //       to Monitor.Enter()’s spinwaiting has shown measurable perf gains in test scenarios.
                 //
+
+                // Monitor.Enter followed by Monitor.Wait is much more expensive than waiting on an event as it involves another
+                // spin, contention, etc. The usual number of spin iterations that would otherwise be used here is doubled to
+                // lessen that extra expense of doing a proper wait.
+                int spinCount = SpinWait.SpinCountforSpinBeforeWait * 2;
+                int sleep1Threshold = SpinWait.Sleep1ThresholdForSpinBeforeWait * 2;
+
                 SpinWait spin = new SpinWait();
-                while (m_currentCount == 0 && !spin.NextSpinWillYield)
+                while (true)
                 {
-                    spin.SpinOnce();
+                    spin.SpinOnce(sleep1Threshold);
+
+                    if (m_currentCount != 0)
+                    {
+                        break;
+                    }
                 }
+
                 // entering the lock and incrementing waiters must not suffer a thread-abort, else we cannot
                 // clean up m_waitCount correctly, which may lead to deadlock due to non-woken waiters.
                 try { }
diff --git a/src/mscorlib/src/System/Threading/SpinLock.cs b/src/mscorlib/src/System/Threading/SpinLock.cs
index eee73ce2bf..dbf2024e5d 100644
--- a/src/mscorlib/src/System/Threading/SpinLock.cs
+++ b/src/mscorlib/src/System/Threading/SpinLock.cs
@@ -65,16 +65,9 @@ namespace System.Threading
 
         private volatile int m_owner;
 
-        // The multiplier factor for the each spinning iteration
-        // This number has been chosen after trying different numbers on different CPUs (4, 8 and 16 ) and this provided the best results
-        private const int SPINNING_FACTOR = 100;
-
         // After how many yields, call Sleep(1)
         private const int SLEEP_ONE_FREQUENCY = 40;
 
-        // After how many yields, call Sleep(0)
-        private const int SLEEP_ZERO_FREQUENCY = 10;
-
         // After how many yields, check the timeout
         private const int TIMEOUT_CHECK_FREQUENCY = 10;
 
@@ -347,48 +340,24 @@ namespace System.Threading
             else //failed to acquire the lock,then try to update the waiters. If the waiters count reached the maximum, jsut break the loop to avoid overflow
             {
                 if ((observedOwner & WAITERS_MASK) != MAXIMUM_WAITERS)
+                {
+                    // This can still overflow, but maybe there will never be that many waiters
                     turn = (Interlocked.Add(ref m_owner, 2) & WAITERS_MASK) >> 1;
+                }
             }
 
-            //***Step 2. Spinning
             //lock acquired failed and waiters updated
-            int processorCount = PlatformHelper.ProcessorCount;
-            if (turn < processorCount)
-            {
-                int processFactor = 1;
-                for (int i = 1; i <= turn * SPINNING_FACTOR; i++)
-                {
-                    Thread.SpinWait((turn + i) * SPINNING_FACTOR * processFactor);
-                    if (processFactor < processorCount)
-                        processFactor++;
-                    observedOwner = m_owner;
-                    if ((observedOwner & LOCK_ANONYMOUS_OWNED) == LOCK_UNOWNED)
-                    {
-                        int newOwner = (observedOwner & WAITERS_MASK) == 0 ? // Gets the number of waiters, if zero
-                            observedOwner | 1 // don't decrement it. just set the lock bit, it is zzero because a previous call of Exit(false) ehich corrupted the waiters
-                            : (observedOwner - 2) | 1; // otherwise decrement the waiters and set the lock bit
-                        Debug.Assert((newOwner & WAITERS_MASK) >= 0);
-
-                        if (CompareExchange(ref m_owner, newOwner, observedOwner, ref lockTaken) == observedOwner)
-                        {
-                            return;
-                        }
-                    }
-                }
 
-                // Check the timeout.
-                if (millisecondsTimeout != Timeout.Infinite && TimeoutHelper.UpdateTimeOut(startTime, millisecondsTimeout) <= 0)
-                {
-                    DecrementWaiters();
-                    return;
-                }
+            //*** Step 2, Spinning and Yielding
+            var spinner = new SpinWait();
+            if (turn > PlatformHelper.ProcessorCount)
+            {
+                spinner.Count = SpinWait.YieldThreshold;
             }
-
-            //*** Step 3, Yielding
-            //Sleep(1) every 50 yields
-            int yieldsoFar = 0;
             while (true)
             {
+                spinner.SpinOnce(SLEEP_ONE_FREQUENCY);
+
                 observedOwner = m_owner;
                 if ((observedOwner & LOCK_ANONYMOUS_OWNED) == LOCK_UNOWNED)
                 {
@@ -403,20 +372,7 @@ namespace System.Threading
                     }
                 }
 
-                if (yieldsoFar % SLEEP_ONE_FREQUENCY == 0)
-                {
-                    Thread.Sleep(1);
-                }
-                else if (yieldsoFar % SLEEP_ZERO_FREQUENCY == 0)
-                {
-                    Thread.Sleep(0);
-                }
-                else
-                {
-                    Thread.Yield();
-                }
-
-                if (yieldsoFar % TIMEOUT_CHECK_FREQUENCY == 0)
+                if (spinner.Count % TIMEOUT_CHECK_FREQUENCY == 0)
                 {
                     //Check the timeout.
                     if (millisecondsTimeout != Timeout.Infinite && TimeoutHelper.UpdateTimeOut(startTime, millisecondsTimeout) <= 0)
@@ -425,8 +381,6 @@ namespace System.Threading
                         return;
                     }
                 }
-
-                yieldsoFar++;
             }
         }
 
diff --git a/src/mscorlib/src/System/Threading/Tasks/Task.cs b/src/mscorlib/src/System/Threading/Tasks/Task.cs
index 25fea588c1..84cb0fdc4f 100644
--- a/src/mscorlib/src/System/Threading/Tasks/Task.cs
+++ b/src/mscorlib/src/System/Threading/Tasks/Task.cs
@@ -10,19 +10,14 @@
 //
 // =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 
-using System;
 using System.Collections.Generic;
 using System.Collections.ObjectModel;
-using System.Runtime;
-using System.Runtime.CompilerServices;
-using System.Runtime.InteropServices;
-using System.Runtime.ExceptionServices;
-using System.Security;
-using System.Threading;
 using System.Diagnostics;
 using System.Diagnostics.Contracts;
-using Microsoft.Win32;
 using System.Diagnostics.Tracing;
+using System.Runtime.CompilerServices;
+using System.Runtime.ExceptionServices;
+using Internal.Runtime.Augments;
 
 // Disable the "reference to volatile field not treated as volatile" error.
 #pragma warning disable 0420
@@ -2971,26 +2966,19 @@ namespace System.Threading.Tasks
                 return false;
             }
 
-            //This code is pretty similar to the custom spinning in MRES except there is no yieling after we exceed the spin count
-            int spinCount = PlatformHelper.IsSingleProcessor ? 1 : System.Threading.SpinWait.YIELD_THRESHOLD; //spin only once if we are running on a single CPU
-            for (int i = 0; i < spinCount; i++)
+            int spinCount = Threading.SpinWait.SpinCountforSpinBeforeWait;
+            var spinner = new SpinWait();
+            while (spinner.Count < spinCount)
             {
+                spinner.SpinOnce(Threading.SpinWait.Sleep1ThresholdForSpinBeforeWait);
+
                 if (IsCompleted)
                 {
                     return true;
                 }
-
-                if (i == spinCount / 2)
-                {
-                    Thread.Yield();
-                }
-                else
-                {
-                    Thread.SpinWait(4 << i);
-                }
             }
 
-            return IsCompleted;
+            return false;
         }
 
         /// <summary>
@@ -3227,7 +3215,7 @@ namespace System.Threading.Tasks
 
             // Skip synchronous execution of continuations if this task's thread was aborted
             bool bCanInlineContinuations = !(((m_stateFlags & TASK_STATE_THREAD_WAS_ABORTED) != 0) ||
-                                              (Thread.CurrentThread.ThreadState == ThreadState.AbortRequested) ||
+                                              (RuntimeThread.CurrentThread.ThreadState == ThreadState.AbortRequested) ||
                                               ((m_stateFlags & (int)TaskCreationOptions.RunContinuationsAsynchronously) != 0));
 
             // Handle the single-Action case
diff --git a/src/vm/comsynchronizable.cpp b/src/vm/comsynchronizable.cpp
index 0554fe3385..8fce346142 100644
--- a/src/vm/comsynchronizable.cpp
+++ b/src/vm/comsynchronizable.cpp
@@ -1624,22 +1624,41 @@ FCIMPL1(FC_BOOL_RET, ThreadNative::IsThreadpoolThread, ThreadBaseObject* thread)
 }
 FCIMPLEND
 
+INT32 QCALLTYPE ThreadNative::GetOptimalMaxSpinWaitsPerSpinIteration()
+{
+    QCALL_CONTRACT;
+
+    INT32 optimalMaxNormalizedYieldsPerSpinIteration;
+
+    BEGIN_QCALL;
+
+    Thread::EnsureYieldProcessorNormalizedInitialized();
+    optimalMaxNormalizedYieldsPerSpinIteration = Thread::GetOptimalMaxNormalizedYieldsPerSpinIteration();
+
+    END_QCALL;
+
+    return optimalMaxNormalizedYieldsPerSpinIteration;
+}
 
 FCIMPL1(void, ThreadNative::SpinWait, int iterations)
 {
     FCALL_CONTRACT;
 
+    if (iterations <= 0)
+    {
+        return;
+    }
+
     //
     // If we're not going to spin for long, it's ok to remain in cooperative mode.
     // The threshold is determined by the cost of entering preemptive mode; if we're
     // spinning for less than that number of cycles, then switching to preemptive
-    // mode won't help a GC start any faster.  That number is right around 1000000 
-    // on my machine.
+    // mode won't help a GC start any faster.
     //
-    if (iterations <= 1000000)
+    if (iterations <= 100000 && Thread::IsYieldProcessorNormalizedInitialized())
     {
-        for(int i = 0; i < iterations; i++)
-            YieldProcessor();
+        for (int i = 0; i < iterations; i++)
+            Thread::YieldProcessorNormalized();
         return;
     }
 
@@ -1649,8 +1668,9 @@ FCIMPL1(void, ThreadNative::SpinWait, int iterations)
     HELPER_METHOD_FRAME_BEGIN_NOPOLL();
     GCX_PREEMP();
 
-    for(int i = 0; i < iterations; i++)
-        YieldProcessor();
+    Thread::EnsureYieldProcessorNormalizedInitialized();
+    for (int i = 0; i < iterations; i++)
+        Thread::YieldProcessorNormalized();
 
     HELPER_METHOD_FRAME_END();
 }
diff --git a/src/vm/comsynchronizable.h b/src/vm/comsynchronizable.h
index 00b055c960..b280c605b8 100644
--- a/src/vm/comsynchronizable.h
+++ b/src/vm/comsynchronizable.h
@@ -97,6 +97,7 @@ public:
     UINT64 QCALLTYPE GetProcessDefaultStackSize();
 
     static FCDECL1(INT32,   GetManagedThreadId, ThreadBaseObject* th);
+    static INT32 QCALLTYPE GetOptimalMaxSpinWaitsPerSpinIteration();
     static FCDECL1(void,    SpinWait,                       int iterations);
     static BOOL QCALLTYPE YieldThread();
     static FCDECL0(Object*, GetCurrentThread);
diff --git a/src/vm/ecalllist.h b/src/vm/ecalllist.h
index 214e190cc7..76be0b172c 100644
--- a/src/vm/ecalllist.h
+++ b/src/vm/ecalllist.h
@@ -710,6 +710,7 @@ FCFuncStart(gRuntimeThreadFuncs)
 #endif // FEATURE_COMINTEROP
     FCFuncElement("InterruptInternal", ThreadNative::Interrupt)
     FCFuncElement("JoinInternal", ThreadNative::Join)
+    QCFuncElement("GetOptimalMaxSpinWaitsPerSpinIterationInternal", ThreadNative::GetOptimalMaxSpinWaitsPerSpinIteration)
 FCFuncEnd()
 
 FCFuncStart(gThreadFuncs)
diff --git a/src/vm/threads.cpp b/src/vm/threads.cpp
index b827140dd4..abc544338b 100644
--- a/src/vm/threads.cpp
+++ b/src/vm/threads.cpp
@@ -11744,3 +11744,87 @@ ULONGLONG Thread::QueryThreadProcessorUsage()
     return ullCurrentUsage - ullPreviousUsage;
 }
 #endif // FEATURE_APPDOMAIN_RESOURCE_MONITORING
+
+int Thread::s_yieldsPerNormalizedYield = 0;
+int Thread::s_optimalMaxNormalizedYieldsPerSpinIteration = 0;
+
+static Crst s_initializeYieldProcessorNormalizedCrst(CrstLeafLock);
+void Thread::InitializeYieldProcessorNormalized()
+{
+    LIMITED_METHOD_CONTRACT;
+
+    CrstHolder lock(&s_initializeYieldProcessorNormalizedCrst);
+
+    if (IsYieldProcessorNormalizedInitialized())
+    {
+        return;
+    }
+
+    // Intel pre-Skylake processor: measured typically 14-17 cycles per yield
+    // Intel post-Skylake processor: measured typically 125-150 cycles per yield
+    const int DefaultYieldsPerNormalizedYield = 1; // defaults are for when no measurement is done
+    const int DefaultOptimalMaxNormalizedYieldsPerSpinIteration = 64; // tuned for pre-Skylake processors, for post-Skylake it should be 7
+    const int MeasureDurationMs = 10;
+    const int MaxYieldsPerNormalizedYield = 10; // measured typically 8-9 on pre-Skylake
+    const int MinNsPerNormalizedYield = 37; // measured typically 37-46 on post-Skylake
+    const int NsPerOptimialMaxSpinIterationDuration = 272; // approx. 900 cycles, measured 281 on pre-Skylake, 263 on post-Skylake
+    const int NsPerSecond = 1000 * 1000 * 1000;
+
+    LARGE_INTEGER li;
+    if (!QueryPerformanceFrequency(&li) || (ULONGLONG)li.QuadPart < 1000 / MeasureDurationMs)
+    {
+        // High precision clock not available or clock resolution is too low, resort to defaults
+        s_yieldsPerNormalizedYield = DefaultYieldsPerNormalizedYield;
+        s_optimalMaxNormalizedYieldsPerSpinIteration = DefaultOptimalMaxNormalizedYieldsPerSpinIteration;
+        return;
+    }
+    ULONGLONG ticksPerSecond = li.QuadPart;
+
+    // Measure the nanosecond delay per yield
+    ULONGLONG measureDurationTicks = ticksPerSecond / (1000 / MeasureDurationMs);
+    unsigned int yieldCount = 0;
+    QueryPerformanceCounter(&li);
+    ULONGLONG startTicks = li.QuadPart;
+    ULONGLONG elapsedTicks;
+    do
+    {
+        for (int i = 0; i < 10; ++i)
+        {
+            YieldProcessor();
+        }
+        yieldCount += 10;
+
+        QueryPerformanceCounter(&li);
+        ULONGLONG nowTicks = li.QuadPart;
+        elapsedTicks = nowTicks - startTicks;
+    } while (elapsedTicks < measureDurationTicks);
+    double nsPerYield = (double)elapsedTicks * NsPerSecond / ((double)yieldCount * ticksPerSecond);
+    if (nsPerYield < 1)
+    {
+        nsPerYield = 1;
+    }
+
+    // Calculate the number of yields required to span the duration of a normalized yield
+    int yieldsPerNormalizedYield = (int)(MinNsPerNormalizedYield / nsPerYield + 0.5);
+    if (yieldsPerNormalizedYield < 1)
+    {
+        yieldsPerNormalizedYield = 1;
+    }
+    else if (yieldsPerNormalizedYield > MaxYieldsPerNormalizedYield)
+    {
+        yieldsPerNormalizedYield = MaxYieldsPerNormalizedYield;
+    }
+
+    // Calculate the maximum number of yields that would be optimal for a late spin iteration. Typically, we would not want to
+    // spend excessive amounts of time (thousands of cycles) doing only YieldProcessor, as SwitchToThread/Sleep would do a
+    // better job of allowing other work to run.
+    int optimalMaxNormalizedYieldsPerSpinIteration =
+        (int)(NsPerOptimialMaxSpinIterationDuration / (yieldsPerNormalizedYield * nsPerYield) + 0.5);
+    if (optimalMaxNormalizedYieldsPerSpinIteration < 1)
+    {
+        optimalMaxNormalizedYieldsPerSpinIteration = 1;
+    }
+
+    s_yieldsPerNormalizedYield = yieldsPerNormalizedYield;
+    s_optimalMaxNormalizedYieldsPerSpinIteration = optimalMaxNormalizedYieldsPerSpinIteration;
+}
diff --git a/src/vm/threads.h b/src/vm/threads.h
index ad433e765b..be36fe624e 100644
--- a/src/vm/threads.h
+++ b/src/vm/threads.h
@@ -5362,6 +5362,70 @@ public:
         m_HijackReturnKind = returnKind;
     }
 #endif // FEATURE_HIJACK
+
+private:
+    static int s_yieldsPerNormalizedYield;
+    static int s_optimalMaxNormalizedYieldsPerSpinIteration;
+
+private:
+    static void InitializeYieldProcessorNormalized();
+
+public:
+    static bool IsYieldProcessorNormalizedInitialized()
+    {
+        LIMITED_METHOD_CONTRACT;
+        return s_yieldsPerNormalizedYield != 0 && s_optimalMaxNormalizedYieldsPerSpinIteration != 0;
+    }
+
+public:
+    static void EnsureYieldProcessorNormalizedInitialized()
+    {
+        LIMITED_METHOD_CONTRACT;
+
+        if (!IsYieldProcessorNormalizedInitialized())
+        {
+            InitializeYieldProcessorNormalized();
+        }
+    }
+
+public:
+    static int GetOptimalMaxNormalizedYieldsPerSpinIteration()
+    {
+        WRAPPER_NO_CONTRACT;
+        _ASSERTE(IsYieldProcessorNormalizedInitialized());
+
+        return s_optimalMaxNormalizedYieldsPerSpinIteration;
+    }
+
+public:
+    static void YieldProcessorNormalized()
+    {
+        WRAPPER_NO_CONTRACT;
+        _ASSERTE(IsYieldProcessorNormalizedInitialized());
+
+        int n = s_yieldsPerNormalizedYield;
+        while (--n >= 0)
+        {
+            YieldProcessor();
+        }
+    }
+
+    static void YieldProcessorNormalizedWithBackOff(unsigned int spinIteration)
+    {
+        WRAPPER_NO_CONTRACT;
+        _ASSERTE(IsYieldProcessorNormalizedInitialized());
+
+        int n = s_optimalMaxNormalizedYieldsPerSpinIteration;
+        if (spinIteration <= 30 && (1 << spinIteration) < n)
+        {
+            n = 1 << spinIteration;
+        }
+        n *= s_yieldsPerNormalizedYield;
+        while (--n >= 0)
+        {
+            YieldProcessor();
+        }
+    }
 };
 
 // End of class Thread
author	Koundinya Veluri <kouvel@users.noreply.github.com>	2017-09-01 13:09:40 -0700
committer	GitHub <noreply@github.com>	2017-09-01 13:09:40 -0700
commit	03bf95c8db9003a5925ca9383dca722a4c651e27 (patch)
tree	5a0087d03ba2dcb4f319a9a104a9f76702fdd82c /src
parent	12db0a3ccf42ab21333872cc3984009aecd06eeb (diff)
download	coreclr-03bf95c8db9003a5925ca9383dca722a4c651e27.tar.gz coreclr-03bf95c8db9003a5925ca9383dca722a4c651e27.tar.bz2 coreclr-03bf95c8db9003a5925ca9383dca722a4c651e27.zip