My client had this issue where their web application (deployed across multiple servers) was randomly making the servers unresponsive with 100% cpu usage.

The first action we took was to configure the IIS to automatically recycle the Application Pools when they are using high CPU for more than a few minutes. In the example below we kill AppPools after 3 minutes of using more than 80% cpu.

dir IIS:\AppPools  | ForEach-Object {
	Write-Host  "Updating $($_.Name) ..."

	$appPoolName = $_.Name
	$appPool = Get-Item "IIS:\AppPools\$appPoolName"
	$appPool.cpu.limit = 80000
	$appPool.cpu.action = "KillW3wp"
	$appPool.cpu.resetInterval = "00:03:00"
	$appPool | Set-Item

That solved the problems, servers stopped getting unresponsive, but we had to investigate what was eating all CPU.

See below how I proceeded with the troubleshooting:

1. Create a Memory Dump

Task Manager - Right Button in the IIS Worker Process, and create a Dump File

2. Install Debug Diagnostic Tool

Download and install Debug Diagnostic Tool

3. Run Crash & Hang Analysis for ASP.NET / IIS

Add your dump (DMP) file, select “CrashHangAnalysis”, and click “Start Analysis”.

4. Review Analysis for Problems

The first page immediately suggests that there’s a Generic Dictionary which is being used by multiple threads and is blocking one thread.

A few pages later we can find the threads which are consuming the most of the CPU:

If we check those top threads we can see that both are blocked in the same call which is invoking GetVersion() on an API client-wrapper. One thread is trying to Insert on the dictionary (cache the API version), while the other is trying to Find (FindEntry) on the dictionary.

5. What was the issue?

Long Explanation:
Dictionary<T> is a HashMap implementation, and like most HashMap implementations it internally uses LinkedLists (to store multiple elements in case different keys result into the same bucket position after being hashed and after taking the hash modulo). The problem is that since Dictionary<T> is not thread-safe, multiple threads trying to change the dictionary may put it into an invalid state (race condition).

Probably there were different threads trying to add the same element to the dictionary at the same time (invoking Insert method which internally invokes the Resize method which modifies the LinkedList), which was putting the LinkedList (and therefore the whole HashMap) into an inconsistent state. If the LinkedList goes into an inconsistent state it can put the threads into an infinite loop, since both Insert() and FindEntry() iterate through the LinkedList and could go into an infinite loop if the LinkedList was inconsistent.

Short Explanation:
Since Dictionary<T> is not thread-safe, multiple threads trying to change the dictionary may put it into an invalid state (race condition). So if you want to share a dictionary across multiple threads you should use a ConcurrentDictionary<T> which as the name implies is a thread-safe class.

It’s a known-issue that concurrent access to Dictionary can cause an infinite-loop and high-CPU in IIS: Link 1, Link 2, Link 3, Link 4.

6. Advanced Troubleshooting using WinDbg

If the Debug Diagnostic Tool didn’t gave us any obvious clue about the root cause, we could use WinDbg to inspect a memory dump (it also supports .NET/CLR). See example here.

A few days ago I installed some malware by mistake, and decided it was time for reinstalling Windows. I’m a huge fan of fresh installs, and do it at least once or twice a year.

Since I use SugarSync online backup for all my important files, I like to heep them handy, all inside a single folder, out of Windows/Users folders (actually they are in a different partition, which helps me to format Operating System partitions without fear). Since I have all important files outside of their default locations, I always have to reconfigure the location for Documents, Pictures, and other folders. Also, since natively you can only change the location for some very specific folders, my fresh installs usually contain some junctions. And since in a few cases I also need to redirect on a file-basis, I also use hard links, which help me to keep safe some individual-files like Filezilla bookmarks, and hosts file.
(Click here to learn more about junctions, hardlinks, and mount points).

With that design, I never have to worry about formating the OS partition, and don’t have to worry about backing up multiple subfolders. It’s all backed up daily in real-time.

To keep that scenario I also have detailed step-by-step instructions on how to configure Windows the way I like it, which include:

  • Dual boot on SSD. One Windows for serious work, and the other for testing different software and installing things that I don’t use very frequently.
  • Ubuntu on another HD partition for open-source work.
  • Small TEMP partition on the start of a HD, for faster writes
  • Paging file on fast TEMP partition
  • TEMP/TMP environment variables pointing to TEMP partition

The instructions include some tips and reminders like:

  • Using YUMI to create bootable USB
  • Which Windows Updates should be skipped and which should be installed
  • Drivers downloaded for offline usage (specially network adapter, in case Windows does not install automatically)
  • How to configure SSD for optimal performance
  • Reconfigure Power Options, Regional Settings,
  • Configure /etc/hosts
  • Which programs to reinstall, in which order (e.g. SQL Server before Visual Studio), and some configurations.

During this last Windows Reinstall, I faced some problems with OneNote, which was not pasting images correctly. Calling onenote /safeboot and clearing cache and settings (which is a default solution for many problems) didn’t help at all. So I had to do the troubleshooting myself.

I fired up Process Monitor, which traces all read/write attempts both to filesystem and registry, and tried to paste into OneNote again. Then I reviewed the log, filtering only for Onenote.exe, and filtering out all “Success”. I was pretty sure that it was a problem in filesystem, and not in registry (because all my changes were related to filesystem, partitions, and folders), so I also filtered out everything related to registry. This is what was left:

I noticed that OneNote was trying to open C:\Windows\TEMP\Drizin, and it was being reparsed (by a junction) to T:\Drizin. My suspect was that the reparse point (junction) was the problem (and not permissions). So I replaced my environment variables (TEMP and TMP) to point directly to T:\%USERNAME%, instead of pointing to C:\Windows\TEMP\%USERNAME% which was being reparsed.

Replacing environment variables for current user and for all new users:

reg add "HKEY_CURRENT_USER\Environment" /v TEMP /t "REG_EXPAND_SZ" /d T:\%USERNAME% /f 
reg add "HKEY_CURRENT_USER\Environment" /v TMP /t "REG_EXPAND_SZ" /d T:\%USERNAME% /f 

reg add "HKEY_USERS\.DEFAULT\Environment" /v TEMP /t "REG_EXPAND_SZ" /d T:\^%USERNAME^% /f 
reg add "HKEY_USERS\.DEFAULT\Environment" /v TMP /t "REG_EXPAND_SZ" /d T:\^%USERNAME^% /f 

PS: Please note that the circunflex before the percent (^%USERNAME^%) is necessary for the HKEY_USERS, or else the command would automatically expand the %USERNAME% variable and then all users would have environment variables pointing to T:\Drizin. :-)

After restarting OneNote, everything started working again.