Grokking the Paul White Parallel Deadlock Demo

I like the word grok. I like it because people who remember its heyday (decades and decades ago) give me weird looks when I use it. I like it because it helps me find other Heinlein fans. Moreover, I like it because its assigned definition resonates with me. I don’t want to merely learn something; I crave a deeper understanding that internalizes lessons and shapes my intuition into expertise, perhaps even mastery. And that is grokking.

Parallel deadlocks are something I first learned about years ago, ehhh, let me change that to “kinda learned.” These suckers are weird. I wasn’t comfortable with them, but after diving into the under-the-hood mechanics of serial execution plans, I was finally ready to revisit how the strange parallel stuff actually works.

Now, the SQL Sensei of parallelism is Paul White, and you should read him if you want to learn X, for X in (parallelism, optimization, RCSI locking, and sundry related topics). Particularly, he has an MCVE of a parallel deadlock at 53:57 of this video. I watched that demo a while ago, but it wasn’t until recently that I grokked it. And that’s what I want to write about.

There are several pieces that have to come together to produce a parallel deadlock. Simplified, a deadlock is when two processes have a dependency on each other, each one saying “I’ll move after you do.”

The central problem is that it’s not obvious how to introduce inter-thread dependencies with parallelism. Paul’s demo relies on two key operators with dependencies. The first is Round Robin Distribute Streams, animated below (and with a prior blog post here).

Round Robin distributes rows in turn, so if one thread isn’t accepting any data, it’s possible to force all the others to wait.

The second key operator is Gather Streams, specifically order-preserving Gather Streams. Regular Gather Streams can pass along rows in any order, as below.

Ordered Gather Streams, by nature of its ordering, needs to grab the next value. And when one of the inputs is empty, there’s no way to know what should be “next,” and it is forced to wait.

We finally have enough to create an inter-thread dependency. We just need to gum up a plan with Round Robin and Ordered Gathered Streams where each is waiting on the other to proceed. Simplified, we want something like below.

But SQL Server doesn’t choose parallelism without reason – there has to be some work it can distribute, even when using secret-sauce trace flag 8649. In this case, a Filter operator will suffice. Moreover, we can use it to filter out any rows on one of the threads, forcing the Gather Streams into a waiting state. We now have a list of requirements:

  1. Round Robin
  2. Ordered Gather Streams
  3. Filter that removes all rows on one thread

Getting #1, the Round Robin, takes a little effort. In this case, a backwards scan is the preferred technique, especially since the ordering is useful for getting #2, the Ordered Gather Streams. But that’s not the only piece to be aware of with Gather Streams. That operator has buffers to hold multiple rows, which helps prevent deadlocks. Getting Gather Streams stuck requires filling up those buffers, using rows of sufficient size or quantity.

Finally, the filtering for #3 looks easy enough with the modulus (%) operator, but there’s another gotcha: SQL Server likes to push predicates into the Scan, bypassing the use of a Filter operator at all. Thankfully, as your developers can attest, this performance optimization is easy to defeat. Just use MAX-length columns, which can’t be pushed down.

So here we are, after assembling all the pieces. This is Paul’s demo code:

SELECT MaxN = MAX(num.n)
FROM dbo.Numbers AS num
WHERE num.n < 30000
AND CONVERT(integer,CONVERT(varchar(max),num.n)) % 2 = 0
OPTION(
MAXDOP 2,
QUERYTRACEON 8649
)

Ordering comes from the MAX. The backwards scan is encouraged by < 30000. The double convert prevents predicate pushdown, and parallelism is from 8649.

Visualized, the deadlock might look like this:

After which the deadlock detector will eventually discover the problem, resolving it by writing rows to tempdb in a process I don’t understand yet. Backblog++

Knowing all the ingredients to this parallel deadlock can definitely help figuring out and fixing actual production parallel deadlocks, but if I’m honest, the genuine reason I studied this stuff is because I’m a nerd and I think it’s really stinkin’ cool. Whatever your reasons, I hope this was helpful to your own understanding. Happy grokking!