Recently we had several IIS application hang crashes caused by thread locks. The threads were locked by calls to Sytem.Diagnostic.PerformanceCounter.NextValue. What we found was that under certain circumstances calls into this routine can block a thread for up to 21 minutes. The circumstances all deal with failures to read the registry due to contention or IO failures.
The root cause of our problem (why are certain performance registry keys unavailable for read/update) is still unknown. So as a temporary measure we removed calls to this method from our IIS application.
Analyzing Thread Blocks
We were really fortunate to have people with the technical skills in house that were comfortable using tools like IISDiag and WinDbg. This saved us a call to Microsoft Technical support.
We were able to capture some process dumps using IISDiag. We got lucky here because we were doing some work on a server when the problem occurred. This allowed us to manually dump the W3WP processes.
Once we had the processes dumped we could use WinDbg to analyze the threads. We found the following pattern with a number of our threads as highlighted by the stack dump below.
To the untrained eye this looks like a bunch of garbage (at least it did to me until I took the time to learn about WinDbg and stack dumps). What it is saying is that the thread has called the Win32 function Sleep from a try/catch exception handler in the System.Diagnostics.PerformanceMonitor.GetData routine. The exception originally occurred in the Microsoft.Win32.RegistryKey.InternalGetValue routine.
Putting the Thread to Sleep
So now we could see that System.Diagnostics.PerformanceMonitor.GetData was putting the thread to sleep. So with a little help from Red-Gate Reflector (this is one of those tools you must have in your bag as a .Net developer) I could see what is happening in that routine.
Below is a snipit from that routine as decompiled by Red-Gate Reflector.
catch (IOException exception)
error = Marshal.GetHRForException(exception);
if (millisecondsTimeout == 0)
millisecondsTimeout = 10;
millisecondsTimeout *= 2;
So if one of the following Win32 Errors is the root cause for the IOException then the thread will go to sleep for some number of milliseconds.
|6||The handle is invalid.|
|1726||The remote procedure call failed.|
|1722||The RPC server is unavailable.|
|21||The device is not ready.|
|167||Unable to lock a region of a file.|
|170||The requested resource is in use.|
|258||The wait operation timed out.|
The thread is being put to sleep because the operation will be tried again (and again and again). That is because the routine has a “while” loop constructed.
int num = 0x11;
int millisecondsTimeout = 0;
int error = 0;
while (num > 0)
So what we discovered is this routine will try up to 17 times before it will finally just give up and throw an exception back up the call stack.
Well 17 tries is not so bad, but what is bad is the bit of code that doubles the sleep timeout every time the while loop is executed. The first puts the thread to sleep for 2 milliseconds, not bad. But if you execute all 17 iterations the thread will block for a little over 21 minutes. Very bad.
Ultimately we will find some sort of issue with our server that results in the registry keys or RPC calls to fail… so it will be on us. But whoever in the .Net Framework BCL team coded this routine to loop for 17 times and wait up to 21 minutes needs to be publicly flogged with hurtful words (which is what I am attempting to do in a pathetic way with this post). Hopefully this gets corrected in future versions.