Oh. Lovely…
Well, this SQL Server has dumps. At least it stays regular. These happen to all be of the “Non-Yielding Scheduler” variety. Now for the fun task of dump diving.
The Journey Begins
Windbg makes starting relatively easy…load the dump:
Click the blue text:
And wait a long time for windbg to load symbol files. (You don’t actually have to wait if you’re just reading a blog. Kinda nice, huh?) When it finally finishes, behold, the offending stack of a dump:
Now, I may not be an expert, but sqlmin!Spinlock sounds like…a spinlock. This thread has been spinning for over a minute, never returning to a waiting state, because something else is holding the spinlock resource.
Thankfully, helpful friends alerted me to a blog that revealed the value of an acquired spinlock “is the Windows thread ID of the owner.” Meaning I might be able to find the cause.
A Different Approach
I already had a suspect thread – it’s the only other one active when I look at all of SQL Server’s threads.
Actually digging through hundreds of threads is annoying of course, and there’s a slightly faster way: !uniqstack
, which removes duplicates. (Thanks to Bob Ward for sharing this trick in one of his presentations.)
If you run that, you’ll see a multitude of stacks with something like this at the top.
Those NtSignalAndWait calls are how SQL Server makes a worker wait. Those workers weren’t doing anything at the time the dump was taken.
However, there’s another thread that isn’t waiting:
I bet this one is holding the spinlock, but I have to prove it.
Obstacles
I need to find the memory address of the owner, and confirm which part of this info is the Windows thread. Let’s start by looking at the spinlocked thread.
Thankfully, the context is already 645, but if we needed to change it, we could use ~645s
. Note that 645 is NOT the Windows thread id. Instead, we can look at the part that says Id: 1d60.3914. 1d60 is the process (sqlservr.exe) id, and 3914 is the thread id.
Let’s pull up the registers saved for the thread.
There are a load of them, but we want rip. The rip saves the instruction address this thread was working on at the time, which should be in the spinlock code, and the spinlock code is what we want to figure out where the memory address is.
Cool. Now we can copy that address into the disassembler.
Found it! The critical part of a spinlock has to be an atomic compare and swap – the lock cmpxchg. In this case it’s looking at the memory address in rdi and replacing the value with rsi (unless the lock is already held, which it is). Looking at our registers, rsi is indeed the Windows thread id, 3914. Now we just look at the memory address held in rdi.
And there we see the Windows thread id of the spinlock owner, 1280!
We can switch to that thread with ~~[1280]s
…yup, it’s the same one I found before, the one that looks like it’s doing Redo operations. Victory!
Of course, when I bothered to actually read the documentation on cmpxchg, I learned that if a different value is in the target address (i.e., the spinlock is held by another thread), that value is placed in the eax register. Meaning I could have bypassed the whole exercise with r eax
.
Concussion
Nope, not a typo. The appropriate ending to a windbg post is indeed head trauma. Hope you enjoyed it.