Forrest Shares Stuff

MAXDOP is a Lie

Let’s start with STATISTICS TIME output from a query:

How many cores is it using? 124ms of cpu over 42ms… = an average of 2.95 cores per second.

Now what if I told you this was a MAXDOP 2 query?

It’s true, this was a MAXDOP 2 query, and it was definitely using more than 2 cores at once (though admittedly STATISTICS TIME can be imprecise). A friend of mine noticed this situation happening, and Paul White provided the explanation: it’s only the parallel branches that are limited to MAXDOP. The coordinator worker (the serial section at the far left of the plan) runs at the same time. Usually it doesn’t have much to do, and the parallel zones don’t run full speed, so you don’t see more than MAXDOP CPU usage.

But now that we know this is possible, how would you design a query to deliberately use more than MAXDOP? My own solution was to find a plan with minimal blocking, and force computation on the serial thread with TOP and some column weirdness.

And here’s the monstrosity I used (StackOverflow2010):

;WITH CTE AS (
SELECT TOP 100000 'a' Val
FROM dbo.Posts p1
JOIN dbo.Posts p2
	ON p1.Id = p2.Id
	)

SELECT 
COUNT(CHECKSUM(CHECKSUM(CHECKSUM(CHECKSUM(CHECKSUM(CHECKSUM(CHECKSUM(CHECKSUM(CHECKSUM(CHECKSUM(
CASE WHEN CONVERT(nchar(1),Val) = N'a' THEN 1 END
)+1)+1)+1)+1)+1)+1)+1)+1)+1+1.1))
FROM CTE
OPTION(
QUERYTRACEON 8649, MERGE JOIN,
MAXDOP 2)

It’s…not elegant, so I’m curious what examples others can find. But hey, this is a query that uses more than MAXDOP!

Edit: it seems that I accidentally trolled Pedro Lopes with the title. No, MAXDOP is not a lie, technically it’s in the documentation (and here, links courtesy of Pedro). Sorry Pedro! However, if you really want to learn how things work, I recommend reading Paul White’s article instead, which happens to be *both* accurate and readable.

A Poem for Flow Distinct

In SQL Server’s algorithmic land
Are many operators elegant
But Flow Distinct is rare to understand
The hash match hides a subtle variant

Unlike the standard aggregates you know
It has no need of input ordering
Perfected by the fact it streams its rows
It stands atop the normal Hash or Stream

It hashes rows it pulls from data source
If new the hash is stored along with key
(Discarding duplicated rows of course)
Accumulating hashes gradually

To spot this operator you can try
A top distinct without an order by

Sparse Stats Surprise

For some reason there are a number of incredible SQL guys named Bob at Microsoft. My personal favorite is boB (yes, boB, like Bob, just spelled backwards). I had the privilege of meeting him, and besides receiving quality SQL advice from him, I got to see boB do some great magic tricks.

I couldn’t help but think of this SQL magician when I ran across this puzzler. The Cardinality Estimator was playing tricks on me, and I imagined boB with the ol’ ball-and-cups game standing behind the screen.

Let’s start with a 2M row table where IsRow is almost always 1, and TypeID is almost always zero. (Full repro script at the bottom of this post.)

And here’s an exact count of values.

Let’s check the cardinality estimate for IsRow = 1

SELECT COUNT(*)
FROM CEPuzzler
WHERE IsRow = 1

That’s a pretty good estimate. Let’s try for TypeID = 1

SELECT COUNT(*)
FROM CEPuzzler
WHERE TypeID = 1

Not perfect, but it’s still a small fraction of the size of the table, so I’d call it a good estimate.

What happens when we combine the two prior filters?

SELECT COUNT(*)
FROM CEPuzzler
WHERE IsRow = 1
AND TypeID = 1

boB would execute a suave magician’s pose here, so, uh, imagine me flourishing a cape or something. Really though, I just let out a giant, “What the heck?!” when I found this. Somehow making the WHERE clause more restrictive made the row estimate increase.

Unlike boB, I’m going to tell you how the trick works. The short version (the one not involving 2363) boils down to density vectors and out-of-range stats.

There are so few TypeID = 1 values that they don’t get picked up in the stats sampling. Thus 1 ends up with the same estimate as other non-existent values, sqrt(2000000) = 1414.

SELECT COUNT(*)
FROM CEPuzzler
WHERE TypeID = -99999

I have a couple indexes on the table, and one of them uses both TypeID and IsRow. When the current CE encounters an out-of-range value, it defaults to using the density vector, which leads to the ridiculous estimate of half the table.

DBCC SHOW_STATISTICS('dbo.CEPuzzler',IX_CEPuzzler_TypeID_IsRow)

There are a couple of potential fixes here (and 4199 is not one of them). Perhaps the most interesting to demote IsRow in the previous index from an indexed column to a mere included column. That keeps it from showing up in the density stats, and leads to better estimates.

But there’s a better solution:

SELECT COUNT(*)
FROM CEPuzzler
WHERE IsRow = 1
AND TypeID = 1
OPTION(USE HINT('FORCE_LEGACY_CARDINALITY_ESTIMATION'))

Abacadabra!

--repro script

CREATE TABLE CEPuzzler (
ID int identity primary key,
IsRow bit not null,
TypeID int not null,
junk char(100)
)

CREATE INDEX IX_CEPuzzler_IsRow ON dbo.CEPuzzler(IsRow)
CREATE INDEX IX_CEPuzzler_TypeID_IsRow ON dbo.CEPuzzler(TypeID,IsRow)

INSERT CEPuzzler
SELECT TOP (2000000)
1-(ROW_NUMBER() OVER(ORDER BY 1/0))%20001/20000, --almost all 1s
(ROW_NUMBER() OVER(ORDER BY 1/0))%400001/400000, --almost all 0s
'junk'
FROM master..spt_values a
CROSS JOIN master..spt_values b
CROSS JOIN master..spt_values c

--your instance may, through differently random sampling,
--pick up some of the rare values, giving you different results
UPDATE STATISTICS CEPuzzler WITH SAMPLE 1 PERCENT

--Here's a query closer to what I encountered in prod,
--showing why these bad stats matter.
--Try it with legacy CE, and see how much faster it is.
--Or just update stats with fullscan...whatever
SELECT junk
FROM CEPuzzler
WHERE IsRow = 1
AND TypeID = 1
--OPTION(USE HINT('FORCE_LEGACY_CARDINALITY_ESTIMATION'))

The Half Apply Query

I came across a bizarre query plan at work, and ended up getting mad at the optimizer. Of course, when I shared the query with Joe Obbish, he told me to “stop thinking like a human.” Thanks, Joe.

See this estimated plan? Buncha seeks, small arrows? Looks like a safe OLTP plan, yeah?

Well, it’s actually a CPU-eating monster that rampaged through my server, causing timeouts and panicked phone calls. Here’s my setup to repro the issue (minus phone calls):

CREATE TABLE #A (
GroupID int,
PersonID int,
DetailID int,
CONSTRAINT PK_#A_GroupID_PersonID PRIMARY KEY CLUSTERED (GroupID,PersonID)
)

CREATE TABLE #B (
GroupID INT,
PersonID INT,
Junk CHAR(100)
)

CREATE INDEX IX_#B_GroupID_PersonID ON #B(GroupID,PersonID)

INSERT #A
SELECT TOP 100000
(ROW_NUMBER() OVER(ORDER BY (SELECT 1/0)))%3,
ROW_NUMBER() OVER(ORDER BY (SELECT 1/0)),
ROW_NUMBER() OVER(ORDER BY (SELECT 1/0))
FROM master.dbo.spt_values x
CROSS JOIN master.dbo.spt_values y

INSERT #B
SELECT TOP 100000
(ROW_NUMBER() OVER(ORDER BY (SELECT 1/0)))%3,
ROW_NUMBER() OVER(ORDER BY (SELECT 1/0)),
'data'
FROM master.dbo.spt_values x
CROSS JOIN master.dbo.spt_values y

And here are what the tables look like – three distinct GroupIDs (0,1, and 2), with everything else unique.

What happened in prod was a plan getting compiled for values outside of the histogram. Though I could simulate that with some more inserts…I’m lazy. OPTIMIZE FOR works just fine, and I include a hint to force the legacy CE, since that’s what I need on my machine. Also, my code plugin is still borked, so I’m pasting as little ugly code as possible.

DECLARE @id1 INT = 2
SELECT *
FROM #A a
LEFT JOIN #B b
ON b.GroupID = a.GroupID AND b.PersonID = a.PersonID
WHERE a.GroupID = @id1 AND a.DetailID < 1000
--Add OPTION section to get bad plan
--OPTION(OPTIMIZE FOR (@ID1 = 3)
--,USE HINT('FORCE_LEGACY_CARDINALITY_ESTIMATION'))

Here’s the in-histogram plan:

And here’s the out-of-histogram plan of fail and awfulness:

Compare it to the estimated plan up top, and you’ll see something has gone terribly wrong. The number of rows read on that bottom seek is really high, and if we look at the operator, we see why.

What the heck? It’s only seeking on GroupID, even though the other key is part of the index! If we look at the leftmost join, we finally see the PersonID comparison. Oops.

So what went wrong? In the good plan, both keys are pushed down into the seek under a nested loop join in an APPLY pattern. (Read more about JOIN vs APPLY Nested Loops from Paul White here.) However, in the bad plan, only one the keys gets pushed down, while the other stays at the join. Weird. Using the new CE gave me the better plan, but some of my helpful friends testing with the new CE still got the bad plan.

To a human, it should be obvious that pushing both keys down is better. But let’s put our think-like-the-optimizer hat on.

When compiling for an out-of-range value, Dr. Optimizer hypothesizes there won’t be any rows returned from #A. Then when seeking only for the matching GroupID in #B, there won’t be any rows (rounded up to 1 row), which is exactly the same as would happen when seeking on both GroupID and PersonID.

Thus we end up with a situation where both the optimal and awful plans have the same query buck cost, because every operation is costed at one row. (Feel free to test this out by adding a FORCESEEK hint to #B that specifies both key columns). It also explains why only some of my friends were able to repro this. Faced with plans of equal cost, SQL Server then used some factor X to choose one or the other on different machines.

So yeah, the solution to this is to stop thinking like a human, and realize that the optimizer is a terrible Scrooge who cares only about cost. With equivalently costed plans, of course it might pick the bad (according to humans) one. Special thanks to my friends Joe and Paul for helping me reason through this.

P.S. For anyone interested, it appears that rule SelToIdxStrategy creates the incomplete seek, and AppIdxToApp the two-column seek.

Cost of a Query

There’s something about being a DBA that gives us special insight into the dysfunctions of both code and organizations. When we’re the ones keeping databases running, we get a first row seat to all the dumb that floats up.

Yet sometimes it feels hard getting developers to care. Five thousand deadlocks an hour? Don’t worry, we have retry logic. Entity Framework uses the wrong datatype in a query, causing it to scan and consume half a server’s CPU? Oh no worries, everything is still working. Remember…servers are cattle, not pets.

Bitterness aside, I found something that helps communicate impact: cost. No, not query bucks – dollars. One of the wonderful benefits of Azure is that organizations can see the cost of garbage code. Well, those of us working with on-premise servers can use the same approach. Send an email informing developers (and their managers) that one of their bad queries would cost $80k a year to run in Azure, and you’ll get traction. It’s also fun to show your boss how much your indexing was worth, right around review time.

Here’s the core of my approach:

Find an approximately equivalent resource in Azure (e.g. if you have an AG, look at a Business Critical Managed Instance).
Use your company’s pricing info (or the handy online estimator) to look at cost, and then calculate dollars per core-hour.
Add in scaling assumptions. For example, any server consuming more than 75% of CPU in your organization may be considered too small and increased in size. This would make the available compute pool 75% instead of 100% for our calculation.
Grab CPU usage metrics from your favorite source (yay Query Store!) for the offending query. Gather them during peak hours if possible, because that’s what you size your physical server around.
Scale the cost to per-year (yes, I admit it’s to get a larger number, but it’s a timeframe the business usually budgets around too).

Step 2: Math. Step 3: Profit! I figured you might not want to math it out yourself, so here’s a Google Sheet with my formulas.

I’ve had a lot of success communicating this way, and I’m especially happy when I have actual Azure resources to work with. There are still some imperfect assumptions in my approach (scale and comparable Azure resources I think are the weakest), so I’d love feedback and suggestions. Now go use math to beat up developers!

Aggregate Operators, Visualized

Sometimes SQL Server has weird behavior that only makes sense when you learn what’s going on (though sometimes it just gets weirder). A recent post by an attentive blogger pointed out a query with GROUP BY returning ordered results, even without an ORDER BY. What’s going on here? The answer has to do with the aggregate operators in SQL Server.

Aggregates are probably one of the first topics you learned in your journey through SQL. They’re used everywhere – not just in relational databases. Open up just about any Excel sheet and you’ll find SUM(), MAX(), COUNT(), or the like. Under the hood, SQL Server uses two primary operators to actually calculate these, the Stream Aggregate and Hash Match Aggregate.

These two algorithms have strong similarities to the Merge Join and Hash Join algorithms (described here, with visualizations, which I recommend reading before continuing).

Let’s begin with an animated Stream Aggregate (using COUNT):

This is very similar to the Merge Join. The key detail is that it requires ordered input. Each successive value within a group gets aggregated, and as soon as the next sequential value shows up, the aggregation starts over.

Hash Match Aggregation works a little differently, as animated below:

Instead of requiring an ordered input, the Hash Aggregate builds a hash table, and stores aggregation details in memory. Of course, that assumes you have enough memory. Say hello to tempdb spilling if you run out. Hash Aggregate also happens to be a blocking operator – it can’t return any rows until it’s consumed all its input.

We end up with the same trade-offs we did with Merge and Hash joins. Stream Aggregate doesn’t block and doesn’t need memory, but needs an ordered input. Hash Aggregate can handle inputs in any order, but needs memory and blocks output until all rows are processed.

You can see both of these aggregates with a single table, with below sample script. Load a table with 10000 rows of identical values, clustered on the first column.

CREATE TABLE #t (ID1 int, ID2 int)
CREATE CLUSTERED INDEX CX_#t ON #t(ID1)

INSERT #t
SELECT TOP (10000)
ROW_NUMBER() OVER(ORDER BY (SELECT 1/0)),
ROW_NUMBER() OVER(ORDER BY (SELECT 1/0))
FROM master..spt_values a
CROSS JOIN master..spt_values b

Run a GROUP BY query on ID1, the clustering column, and you get a Stream Aggregate.

A GROUP BY on ID2, which SQL Server can’t guarantee is ordered, will produce a Hash Aggregate plan.

The results are no longer ordered for the Hash Aggregate plan either.

You can hopefully guess the answer to the ordering puzzle by now. The blogger was looking at a query where an existing index made a Stream Aggregate possible, and got the unexpectedly ordered results because of it. With different queries however, it’s easy to end up with unordered results as SQL Server chooses the Hash Match algorithm instead. No longer weird, but still pretty cool.

Grokking the Paul White Parallel Deadlock Demo

I like the word grok. I like it because people who remember its heyday (decades and decades ago) give me weird looks when I use it. I like it because it helps me find other Heinlein fans. Moreover, I like it because its assigned definition resonates with me. I don’t want to merely learn something; I crave a deeper understanding that internalizes lessons and shapes my intuition into expertise, perhaps even mastery. And that is grokking.

Parallel deadlocks are something I first learned about years ago, ehhh, let me change that to “kinda learned.” These suckers are weird. I wasn’t comfortable with them, but after diving into the under-the-hood mechanics of serial execution plans, I was finally ready to revisit how the strange parallel stuff actually works.

Now, the SQL Sensei of parallelism is Paul White, and you should read him if you want to learn X, for X in (parallelism, optimization, RCSI locking, and sundry related topics). Particularly, he has an MCVE of a parallel deadlock at 53:57 of this video. I watched that demo a while ago, but it wasn’t until recently that I grokked it. And that’s what I want to write about.

There are several pieces that have to come together to produce a parallel deadlock. Simplified, a deadlock is when two processes have a dependency on each other, each one saying “I’ll move after you do.”

The central problem is that it’s not obvious how to introduce inter-thread dependencies with parallelism. Paul’s demo relies on two key operators with dependencies. The first is Round Robin Distribute Streams, animated below (and with a prior blog post here).

Round Robin distributes rows in turn, so if one thread isn’t accepting any data, it’s possible to force all the others to wait.

The second key operator is Gather Streams, specifically order-preserving Gather Streams. Regular Gather Streams can pass along rows in any order, as below.

Ordered Gather Streams, by nature of its ordering, needs to grab the next value. And when one of the inputs is empty, there’s no way to know what should be “next,” and it is forced to wait.

We finally have enough to create an inter-thread dependency. We just need to gum up a plan with Round Robin and Ordered Gathered Streams where each is waiting on the other to proceed. Simplified, we want something like below.

But SQL Server doesn’t choose parallelism without reason – there has to be some work it can distribute, even when using secret-sauce trace flag 8649. In this case, a Filter operator will suffice. Moreover, we can use it to filter out any rows on one of the threads, forcing the Gather Streams into a waiting state. We now have a list of requirements:

Round Robin
Ordered Gather Streams
Filter that removes all rows on one thread

Getting #1, the Round Robin, takes a little effort. In this case, a backwards scan is the preferred technique, especially since the ordering is useful for getting #2, the Ordered Gather Streams. But that’s not the only piece to be aware of with Gather Streams. That operator has buffers to hold multiple rows, which helps prevent deadlocks. Getting Gather Streams stuck requires filling up those buffers, using rows of sufficient size or quantity.

Finally, the filtering for #3 looks easy enough with the modulus (%) operator, but there’s another gotcha: SQL Server likes to push predicates into the Scan, bypassing the use of a Filter operator at all. Thankfully, as your developers can attest, this performance optimization is easy to defeat. Just use MAX-length columns, which can’t be pushed down.

So here we are, after assembling all the pieces. This is Paul’s demo code:

SELECT MaxN = MAX(num.n)
FROM dbo.Numbers AS num
WHERE num.n < 30000
AND CONVERT(integer,CONVERT(varchar(max),num.n)) % 2 = 0
OPTION(
MAXDOP 2,
QUERYTRACEON 8649
)

Ordering comes from the MAX. The backwards scan is encouraged by < 30000. The double convert prevents predicate pushdown, and parallelism is from 8649.

Visualized, the deadlock might look like this:

After which the deadlock detector will eventually discover the problem, resolving it by writing rows to tempdb in a process I don’t understand yet. Backblog++

Knowing all the ingredients to this parallel deadlock can definitely help figuring out and fixing actual production parallel deadlocks, but if I’m honest, the genuine reason I studied this stuff is because I’m a nerd and I think it’s really stinkin’ cool. Whatever your reasons, I hope this was helpful to your own understanding. Happy grokking!

The Curious Case of the Insert Deadlock

My favorite deadlock story actually isn’t about SQL Server, but comes from the American Airlines pilot who taught me math. Slightly embellished, it goes like this.

In the distant past some freshly minted MBAs claimed they could save money by implementing a new scheduling system, and for some reason the company’s leadership listened to them. Several months later this pilot lands at his destination, but his gate is occupied by another plane and the control tower tells him to wait – its pilot is running late. Time passes with no progress, and he gets more and more frustrated about the delay. Determined to chew out the tardy pilot, he radios in again to find out the name of the offender holding him up. The pilot they were waiting on…was him!

This was a deadlock. He couldn’t dock his plane until the other one moved, and the other one couldn’t move until he docked and transferred. SQL Server deadlocks follow the same pattern of two (or more) processes needing the other to progress before they can continue, but only sometimes have angry passengers.

A lot of deadlocks are simple, but I found an unusual one recently – two INSERTs deadlocking – that forced me to dig deeper. As with most things horrible in SQL Server, it began with Entity Framework/LINQ, and was compounded by developer ignorance. As much as I hate LINQ for performance, it *is* a perpetual fountain of dumb, i.e., blog post material.

Here’s my recreation of the underlying table:

CREATE TABLE dbo.DLtest1 (ID INT identity, junk char(4)) 
CREATE CLUSTERED INDEX CX_DLtest1_ID ON dbo.DLtest1(ID)

And here’s what the query looked like:

INSERT dbo.DLtest1 (junk) VALUES ('junk') 

SELECT ID FROM dbo.DLtest1 
WHERE ID = SCOPE_IDENTITY()

Guess what isolation level this was running in? That’s right, the good ol’ EF TransactionScope() special, Serializable. Besides that, note the SELECT clause not stopping at SCOPE_IDENTITY, but actually joining to the table on it. Yeah…

Eagle-eyed readers might have noticed another problem: the clustered index wasn’t defined as unique. Making it unique is actually enough to fix the issue, and what I did in production. Functional code is boring though, so let’s keep investigating the deadlock as is.

The first mystery was the sequence of locks. Running an extended event on the batch, I saw LAST_MODE lock released…but it had never been acquired! Special thanks to Paul White for pointing out that the events for lock_acquired are incorrect. First, LAST_MODE is RX-X, and second (mind-blowingly), SQL Server may request a certain lock, but the lock granted could be different!

A quick aside on range locks in SQL Server. We mostly see these with Serializable isolation level. They have two components: the range and the key it is attached to. The range is the potential space between the key and next lower key. I find it helps to imagine these on a number line. An RX-S lock on 5 would be an X lock on the range between 4 (assuming that’s the next lower value) and 5, with an S lock on the value of 5 itself.

After Paul’s answer (and some XE wrangling), I was finally able to piece together the sequence of locks in an INSERT + SELECT. The INSERT begins with the RI-N lock (the I-lock preventing S and X on the range section, but with no lock on the key). This transitions to an X lock on the key when it is inserted. With the SELECT section, RS-S locks are taken on the key and the next higher key (in this case, usually infinity). But with the existing X lock on the key, the requested RS-S lock is instead granted as RX-X.

Easy, right? Who am I kidding, this took me days of poking and prodding. Here, have an animation:

Time to tackle the deadlock…here’s what I ran in two different sessions to reproduce it:

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
SET NOCOUNT ON

DECLARE @stop datetime = dateadd(second,6,GETDATE())
DECLARE @trash int


WHILE @stop > GETDATE()
BEGIN
	BEGIN TRAN
	INSERT dbo.DLtest1 (junk)
	VALUES ('junk')

	SELECT @trash = ID
	FROM dbo.DLtest1
	WHERE ID = SCOPE_IDENTITY()
	COMMIT
END

But there was something weird still going on – many of the deadlock graphs (did I mention this batch produces different types of deadlocks? hurray for LINQ!) had RI-N involved. No matter what I modeled on the whiteboard, I couldn’t see how that was possible.

I finally managed to catch one by using an XE tracking both locks and waits, linking back to the table key by %%lockres%%.

Well, that was fun to read through. I had to map everything out on the whiteboard, and eventually figured out the source of the RI-N deadlock. It all starts while the second session is blocked on its INSERT. The first session releases its locks and continues to its next INSERT, but the value it gets from the cache leapfrogs the second session’s value. Eventually, the second session actually takes a second RI-N lock, holding onto the original RI-N lock all the while!

Blah blah blah, it would take a lot more words to explain. Here, have an animation instead, it’s probably a lot more helpful:

There you have it, another deadlock courtesy of freshly-minted MBAs, err, well-meaning developers. I’m still curious why the RI-N ∞ isn’t released, but perhaps now’s the time to stop and let my readers disembark. Hope you enjoyed the ride!

Performance Impact of Small Queries

All performance statistics are lies, but some are useful. Yes, just some, and opinions differ about which ones. If you want to start a nerd fight, find a cluster of Senior DBAs at SQL Saturday and tell them how much you love Page Life Expectancy.

As a tuning aficionado, I happen to adore CPU metrics, such as from exec_query_stats and Query Store. They tend to be extremely helpful in finding a server’s worst offending queries. But I see cases where a tiny SELECT hitting a million times had more effect on the server than Query Store CPU stats would suggest.

This makes sense – every query is going to have overhead, such as sending rows to the client, that probably isn’t captured in measurements. But can we prove it, and maybe get a sense of how much is missed?

Let’s create a small query, spam it against a server, and compare what the server says about its CPU to SQL Server’s metrics. Here’s the demo table, and a proc to seek a single row from it (designed in a way that doesn’t actually return any rows).

DROP TABLE IF EXISTS dbo.seektest
GO

CREATE TABLE dbo.seektest (
ID INT IDENTITY PRIMARY KEY,
junk CHAR(1)
)

INSERT dbo.seektest
SELECT TOP (5000000) 'a'
FROM master.dbo.spt_values a
CROSS JOIN master.dbo.spt_values b
GO

CREATE PROC dbo.SeekTesting1
(@ID int)
AS
BEGIN

DECLARE @b int

SELECT @b = ID
FROM dbo.SeekTest
WHERE ID = @ID

END

Now use a script to loop this pointless busy work (I’m channeling my inner junior developer here). We don’t want to use a program on the same server, since that will affect CPU. I’m lucky enough to have another large server available on the same network, where I run this PowerShell script adapted from Merrill Aldrich. (The original link to SQL Blog is dead, but you can find Google’s cached version when searching for “Quick and Dirty PowerShell SQL Server Load Test”)

$numWorkers = 32

$startTime = ( Get-Date ).AddSeconds(20)
$endTime = ( Get-Date ).AddSeconds(30)

$jobscript = {
    param( $startTime, $endTime, $conn )
         
    while ( ( Get-Date ) -lt $startTime ) { Start-Sleep -Milliseconds 100 }
    
    while ( ( Get-Date ) -lt $endTime ) { 
        $n = Get-Random -Maximum 5000001;
        Invoke-Sqlcmd -Query "EXEC dbo.SeekTesting1 $n" -Server "YourTestServer" -Database "TestDB";
        }
}

(1..$numWorkers) | foreach {
    Start-Job $jobscript -ArgumentList $startTime,$endTime,$conn
}

#Run below line for cleanup
#Get-Job | Remove-Job

Launching the script gives me a 10 second window where I see CPU climb from 1% to about 20%. Since my test rig is 48 cores, I calculate this as about 91s of added CPU activity. Then I compare to what exists in dmvs/Query Store.

SELECT st.text, qs.*
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st
WHERE qs.query_hash = 0x0946FE740B52871B

Both of these show about 5s of CPU for the query – way less than 91s. By the way, you can reset stats/QDS with the below script between runs (you are using a test server, right?)

DBCC FREEPROCCACHE
GO

ALTER DATABASE TestDB SET QUERY_STORE CLEAR;

I pulled up PerfView to see where time was spent, and grabbed a flame graph. The boxed section is the actual query execution, and probably the only part that gets measured in SQL Server. The rest goes to things like parsing, checking security, and sending data. You know, fluff.

I ran multiple tests and variations, such as single threaded execution and adjusting the proc to return data. Queries were consistent in taking an additional ~.2ms of CPU overhead per execution, beyond what dmvs and QDS reported. Additional details for the curious: SQL Server 2016 SP2 Developer, 3.18GHz clock.

I imagine this overhead number is highly dependent on other factors, but it’s good enough for me to start estimating the actual costs of those tiny queries that get repeated a bajillion times. You might know them, the ones where developers loop through a couple hundred thousand values because of EF?

When I do the math, for a 32-core server, a 10% increase in CPU would need 57.6 million executions an hour, or 16k/sec. Generalized, a 10% CPU overhead needs 500 executions/sec/core. Despite a lot of assumptions, I’m happy to have a rule of thumb to start from. I hope you find the estimate useful too, but remember: this counts as a performance stat, and all performance stats are lies.

Round Robin Hunting

Simple does not equal boring. Should not. But I have a really bad habit of ignoring “simple” things I think I understand, only to find out later that some subtle interaction turns “boring” into “aggggghhhh!” This was the case with Round Robin Distribute Streams. Here’s its mugshot:

All it does is take rows from a single producer thread and distribute them in turn, or round robin, to multiple consumer threads. It looks something like this:

Simple, right? Nothing else to see, time to move on, etc? That’s what I thought, until I started seeing Round Robin pop up in a bizarre class of not-so-simple plans, plans involving intra-query parallel deadlocks.

Parallel deadlocks are a fascinating topic I hope to cover later, but let’s start just by reproducing a Round Robin. It turns out that hunting for this operator is an instructive journey in itself.

Usually parallelism occurs on large tables, and access on the base tables is parallel without a need to distribute streams, like below:

Also, certain conditions need to be true for parallelism to occur. For example, no matter how large your table is, a simple SELECT * is going to be serial. Stuffing rows into network packets is done serially, and if there’s no other work to distribute, the whole plan avoids parallelism. Add a WHERE clause or change to a SELECT COUNT(*), and you can start to see it.

Maybe if the table is really small, SQL Server will distribute streams instead of parallel scanning. Let’s build a small table:

DROP TABLE IF EXISTS dbo.boringtable
GO

CREATE TABLE dbo.boringtable (
ID INT IDENTITY PRIMARY KEY,
yawn CHAR(100)

--1 row into the boring table (can change number of rows through TOP(N))
INSERT dbo.boringtable
SELECT TOP (1) REPLICATE('z',100)
FROM master..spt_values a
CROSS JOIN master..spt_values b
CROSS JOIN master..spt_values c

Now, let’s construct a parallel query, and since this is just a test, I’m going to cheat a little with trace flags. I use a WHERE clause so SQL Server sees there’s work to distribute, then I use 8649 to bully the optimizer into using parallelism and 8757 to skip a trivial plan (so that the optimizer can consider parallelism).

SELECT *
FROM dbo.boringtable
WHERE yawn LIKE 'z%'
OPTION(
QUERYTRACEON 8649
,QUERYTRACEON 8757
)

Well that’s interesting. Even though the table has one row – and SQL Server knows it – the table is accessed in a parallel way, with just a single lucky thread getting to do anything.

So small tables don’t cause a Distribute Streams. Maybe we can use something that prevents the scan from being parallel. There happen to be numerous options, but my favorite one is the backwards scan. SQL Server can do an ascending scan (assuming that’s the natural index order) in parallel, but not descending. Weird, right? Also useful for demos. Let’s add an ORDER BY DESC.

SELECT *
FROM dbo.boringtable
WHERE yawn LIKE 'z%'
ORDER BY ID DESC
OPTION(
QUERYTRACEON 8649
,QUERYTRACEON 8757
)

That didn’t quite work. Instead of a descending scan, we get a parallel ascending scan followed by a descending sort, because the cost is so low on a single-row table.

The next goal is to prevent the Sort. We can discourage through costing, or use a TOP to force a separate region of the plan.

;WITH CTE AS (
SELECT TOP 1 *
FROM dbo.boringtable
ORDER BY ID DESC
)

SELECT *
FROM CTE
WHERE yawn LIKE 'z%'
OPTION(
QUERYTRACEON 8649
,QUERYTRACEON 8757 --actually unnecessary now
)

Success! Highly contrived, but still a success, and it’s actually similar to cases where I see Round Robin in the wild: a descending scan feeding a parallel section.

Say, what happens if we tell SQL Server this is a large table and then do a descending scan?

UPDATE STATISTICS dbo.BoringTable WITH PAGECOUNT = 10000, ROWCOUNT = 5000000
GO

SELECT TOP 10 *
FROM dbo.boringtable
WHERE yawn LIKE '%a%'
ORDER BY ID DESC
OPTION(
QUERYTRACEON 8649
,QUERYTRACEON 8757
)

Huh…