Aggregate Operators, Visualized

Sometimes SQL Server has weird behavior that only makes sense when you learn what’s going on (though sometimes it just gets weirder). A recent post by an attentive blogger pointed out a query with GROUP BY returning ordered results, even without an ORDER BY. What’s going on here? The answer has to do with the aggregate operators in SQL Server.

Aggregates are probably one of the first topics you learned in your journey through SQL. They’re used everywhere – not just in relational databases. Open up just about any Excel sheet and you’ll find SUM(), MAX(), COUNT(), or the like. Under the hood, SQL Server uses two primary operators to actually calculate these, the Stream Aggregate and Hash Match Aggregate.

These two algorithms have strong similarities to the Merge Join and Hash Join algorithms (described here, with visualizations, which I recommend reading before continuing).

Let’s begin with an animated Stream Aggregate (using COUNT):

This is very similar to the Merge Join. The key detail is that it requires ordered input. Each successive value within a group gets aggregated, and as soon as the next sequential value shows up, the aggregation starts over.

Hash Match Aggregation works a little differently, as animated below:

Instead of requiring an ordered input, the Hash Aggregate builds a hash table, and stores aggregation details in memory. Of course, that assumes you have enough memory. Say hello to tempdb spilling if you run out. Hash Aggregate also happens to be a blocking operator – it can’t return any rows until it’s consumed all its input.

We end up with the same trade-offs we did with Merge and Hash joins. Stream Aggregate doesn’t block and doesn’t need memory, but needs an ordered input. Hash Aggregate can handle inputs in any order, but needs memory and blocks output until all rows are processed.

You can see both of these aggregates with a single table, with below sample script. Load a table with 10000 rows of identical values, clustered on the first column.

CREATE TABLE #t (ID1 int, ID2 int)
CREATE CLUSTERED INDEX CX_#t ON #t(ID1)

INSERT #t
SELECT TOP (10000)
ROW_NUMBER() OVER(ORDER BY (SELECT 1/0)),
ROW_NUMBER() OVER(ORDER BY (SELECT 1/0))
FROM master..spt_values a
CROSS JOIN master..spt_values b

Run a GROUP BY query on ID1, the clustering column, and you get a Stream Aggregate.

A GROUP BY on ID2, which SQL Server can’t guarantee is ordered, will produce a Hash Aggregate plan.

The results are no longer ordered for the Hash Aggregate plan either.

You can hopefully guess the answer to the ordering puzzle by now. The blogger was looking at a query where an existing index made a Stream Aggregate possible, and got the unexpectedly ordered results because of it. With different queries however, it’s easy to end up with unordered results as SQL Server chooses the Hash Match algorithm instead. No longer weird, but still pretty cool.

Grokking the Paul White Parallel Deadlock Demo

I like the word grok. I like it because people who remember its heyday (decades and decades ago) give me weird looks when I use it. I like it because it helps me find other Heinlein fans. Moreover, I like it because its assigned definition resonates with me. I don’t want to merely learn something; I crave a deeper understanding that internalizes lessons and shapes my intuition into expertise, perhaps even mastery. And that is grokking.

Parallel deadlocks are something I first learned about years ago, ehhh, let me change that to “kinda learned.” These suckers are weird. I wasn’t comfortable with them, but after diving into the under-the-hood mechanics of serial execution plans, I was finally ready to revisit how the strange parallel stuff actually works.

Now, the SQL Sensei of parallelism is Paul White, and you should read him if you want to learn X, for X in (parallelism, optimization, RCSI locking, and sundry related topics). Particularly, he has an MCVE of a parallel deadlock at 53:57 of this video. I watched that demo a while ago, but it wasn’t until recently that I grokked it. And that’s what I want to write about.

There are several pieces that have to come together to produce a parallel deadlock. Simplified, a deadlock is when two processes have a dependency on each other, each one saying “I’ll move after you do.”

The central problem is that it’s not obvious how to introduce inter-thread dependencies with parallelism. Paul’s demo relies on two key operators with dependencies. The first is Round Robin Distribute Streams, animated below (and with a prior blog post here).

Round Robin distributes rows in turn, so if one thread isn’t accepting any data, it’s possible to force all the others to wait.

The second key operator is Gather Streams, specifically order-preserving Gather Streams. Regular Gather Streams can pass along rows in any order, as below.

Ordered Gather Streams, by nature of its ordering, needs to grab the next value. And when one of the inputs is empty, there’s no way to know what should be “next,” and it is forced to wait.

We finally have enough to create an inter-thread dependency. We just need to gum up a plan with Round Robin and Ordered Gathered Streams where each is waiting on the other to proceed. Simplified, we want something like below.

But SQL Server doesn’t choose parallelism without reason – there has to be some work it can distribute, even when using secret-sauce trace flag 8649. In this case, a Filter operator will suffice. Moreover, we can use it to filter out any rows on one of the threads, forcing the Gather Streams into a waiting state. We now have a list of requirements:

  1. Round Robin
  2. Ordered Gather Streams
  3. Filter that removes all rows on one thread

Getting #1, the Round Robin, takes a little effort. In this case, a backwards scan is the preferred technique, especially since the ordering is useful for getting #2, the Ordered Gather Streams. But that’s not the only piece to be aware of with Gather Streams. That operator has buffers to hold multiple rows, which helps prevent deadlocks. Getting Gather Streams stuck requires filling up those buffers, using rows of sufficient size or quantity.

Finally, the filtering for #3 looks easy enough with the modulus (%) operator, but there’s another gotcha: SQL Server likes to push predicates into the Scan, bypassing the use of a Filter operator at all. Thankfully, as your developers can attest, this performance optimization is easy to defeat. Just use MAX-length columns, which can’t be pushed down.

So here we are, after assembling all the pieces. This is Paul’s demo code:

SELECT MaxN = MAX(num.n)
FROM dbo.Numbers AS num
WHERE num.n < 30000
AND CONVERT(integer,CONVERT(varchar(max),num.n)) % 2 = 0
OPTION(
MAXDOP 2,
QUERYTRACEON 8649
)

Ordering comes from the MAX. The backwards scan is encouraged by < 30000. The double convert prevents predicate pushdown, and parallelism is from 8649.

Visualized, the deadlock might look like this:

After which the deadlock detector will eventually discover the problem, resolving it by writing rows to tempdb in a process I don’t understand yet. Backblog++

Knowing all the ingredients to this parallel deadlock can definitely help figuring out and fixing actual production parallel deadlocks, but if I’m honest, the genuine reason I studied this stuff is because I’m a nerd and I think it’s really stinkin’ cool. Whatever your reasons, I hope this was helpful to your own understanding. Happy grokking!

The Curious Case of the Insert Deadlock

My favorite deadlock story actually isn’t about SQL Server, but comes from the American Airlines pilot who taught me math. Slightly embellished, it goes like this.

In the distant past some freshly minted MBAs claimed they could save money by implementing a new scheduling system, and for some reason the company’s leadership listened to them. Several months later this pilot lands at his destination, but his gate is occupied by another plane and the control tower tells him to wait – its pilot is running late. Time passes with no progress, and he gets more and more frustrated about the delay. Determined to chew out the tardy pilot, he radios in again to find out the name of the offender holding him up. The pilot they were waiting on…was him!

This was a deadlock. He couldn’t dock his plane until the other one moved, and the other one couldn’t move until he docked and transferred. SQL Server deadlocks follow the same pattern of two (or more) processes needing the other to progress before they can continue, but only sometimes have angry passengers.

A lot of deadlocks are simple, but I found an unusual one recently – two INSERTs deadlocking – that forced me to dig deeper. As with most things horrible in SQL Server, it began with Entity Framework/LINQ, and was compounded by developer ignorance. As much as I hate LINQ for performance, it *is* a perpetual fountain of dumb, i.e., blog post material.

Here’s my recreation of the underlying table:

CREATE TABLE dbo.DLtest1 (ID INT identity, junk char(4)) 
CREATE CLUSTERED INDEX CX_DLtest1_ID ON dbo.DLtest1(ID)

And here’s what the query looked like:

INSERT dbo.DLtest1 (junk) VALUES ('junk') 

SELECT ID FROM dbo.DLtest1 
WHERE ID = SCOPE_IDENTITY()

Guess what isolation level this was running in? That’s right, the good ol’ EF TransactionScope() special, Serializable. Besides that, note the SELECT clause not stopping at SCOPE_IDENTITY, but actually joining to the table on it. Yeah…

Eagle-eyed readers might have noticed another problem: the clustered index wasn’t defined as unique. Making it unique is actually enough to fix the issue, and what I did in production. Functional code is boring though, so let’s keep investigating the deadlock as is.

The first mystery was the sequence of locks. Running an extended event on the batch, I saw LAST_MODE lock released…but it had never been acquired! Special thanks to Paul White for pointing out that the events for lock_acquired are incorrect. First, LAST_MODE is RX-X, and second (mind-blowingly), SQL Server may request a certain lock, but the lock granted could be different!

A quick aside on range locks in SQL Server. We mostly see these with Serializable isolation level. They have two components: the range and the key it is attached to. The range is the potential space between the key and next lower key. I find it helps to imagine these on a number line. An RX-S lock on 5 would be an X lock on the range between 4 (assuming that’s the next lower value) and 5, with an S lock on the value of 5 itself.

Pretty pictures!

After Paul’s answer (and some XE wrangling), I was finally able to piece together the sequence of locks in an INSERT + SELECT. The INSERT begins with the RI-N lock (the I-lock preventing S and X on the range section, but with no lock on the key). This transitions to an X lock on the key when it is inserted. With the SELECT section, RS-S locks are taken on the key and the next higher key (in this case, usually infinity). But with the existing X lock on the key, the requested RS-S lock is instead granted as RX-X.

Easy, right? Who am I kidding, this took me days of poking and prodding. Here, have an animation:

Time to tackle the deadlock…here’s what I ran in two different sessions to reproduce it:

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
SET NOCOUNT ON

DECLARE @stop datetime = dateadd(second,6,GETDATE())
DECLARE @trash int


WHILE @stop > GETDATE()
BEGIN
	BEGIN TRAN
	INSERT dbo.DLtest1 (junk)
	VALUES ('junk')

	SELECT @trash = ID
	FROM dbo.DLtest1
	WHERE ID = SCOPE_IDENTITY()
	COMMIT
END

But there was something weird still going on – many of the deadlock graphs (did I mention this batch produces different types of deadlocks? hurray for LINQ!) had RI-N involved. No matter what I modeled on the whiteboard, I couldn’t see how that was possible.

I finally managed to catch one by using an XE tracking both locks and waits, linking back to the table key by %%lockres%%.

Well, that was fun to read through. I had to map everything out on the whiteboard, and eventually figured out the source of the RI-N deadlock. It all starts while the second session is blocked on its INSERT. The first session releases its locks and continues to its next INSERT, but the value it gets from the cache leapfrogs the second session’s value. Eventually, the second session actually takes a second RI-N lock, holding onto the original RI-N lock all the while!

Blah blah blah, it would take a lot more words to explain. Here, have an animation instead, it’s probably a lot more helpful:

There you have it, another deadlock courtesy of freshly-minted MBAs, err, well-meaning developers. I’m still curious why the RI-N ∞ isn’t released, but perhaps now’s the time to stop and let my readers disembark. Hope you enjoyed the ride!

Performance Impact of Small Queries

All performance statistics are lies, but some are useful. Yes, just some, and opinions differ about which ones. If you want to start a nerd fight, find a cluster of Senior DBAs at SQL Saturday and tell them how much you love Page Life Expectancy.

As a tuning aficionado, I happen to adore CPU metrics, such as from exec_query_stats and Query Store. They tend to be extremely helpful in finding a server’s worst offending queries. But I see cases where a tiny SELECT hitting a million times had more effect on the server than Query Store CPU stats would suggest.

This makes sense – every query is going to have overhead, such as sending rows to the client, that probably isn’t captured in measurements. But can we prove it, and maybe get a sense of how much is missed?

Let’s create a small query, spam it against a server, and compare what the server says about its CPU to SQL Server’s metrics. Here’s the demo table, and a proc to seek a single row from it (designed in a way that doesn’t actually return any rows).

DROP TABLE IF EXISTS dbo.seektest
GO

CREATE TABLE dbo.seektest (
ID INT IDENTITY PRIMARY KEY,
junk CHAR(1)
)

INSERT dbo.seektest
SELECT TOP (5000000) 'a'
FROM master.dbo.spt_values a
CROSS JOIN master.dbo.spt_values b
GO

CREATE PROC dbo.SeekTesting1
(@ID int)
AS
BEGIN

DECLARE @b int

SELECT @b = ID
FROM dbo.SeekTest
WHERE ID = @ID

END

 

Now use a script to loop this pointless busy work (I’m channeling my inner junior developer here). We don’t want to use a program on the same server, since that will affect CPU. I’m lucky enough to have another large server available on the same network, where I run this PowerShell script adapted from Merrill Aldrich. (The original link to SQL Blog is dead, but you can find Google’s cached version when searching for “Quick and Dirty PowerShell SQL Server Load Test”)

$numWorkers = 32

$startTime = ( Get-Date ).AddSeconds(20)
$endTime = ( Get-Date ).AddSeconds(30)

$jobscript = {
    param( $startTime, $endTime, $conn )
         
    while ( ( Get-Date ) -lt $startTime ) { Start-Sleep -Milliseconds 100 }
    
    while ( ( Get-Date ) -lt $endTime ) { 
        $n = Get-Random -Maximum 5000001;
        Invoke-Sqlcmd -Query "EXEC dbo.SeekTesting1 $n" -Server "YourTestServer" -Database "TestDB";
        }
}

(1..$numWorkers) | foreach {
    Start-Job $jobscript -ArgumentList $startTime,$endTime,$conn
}

#Run below line for cleanup
#Get-Job | Remove-Job

Launching the script gives me a 10 second window where I see CPU climb from 1% to about 20%. Since my test rig is 48 cores, I calculate this as about 91s of added CPU activity. Then I compare to what exists in dmvs/Query Store.

SELECT st.text, qs.*
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st
WHERE qs.query_hash = 0x0946FE740B52871B

Both of these show about 5s of CPU for the query – way less than 91s. By the way, you can reset stats/QDS with the below script between runs (you are using a test server, right?)

DBCC FREEPROCCACHE
GO

ALTER DATABASE TestDB SET QUERY_STORE CLEAR;

I pulled up PerfView to see where time was spent, and grabbed a flame graph. The boxed section is the actual query execution, and probably the only part that gets measured in SQL Server. The rest goes to things like parsing, checking security, and sending data. You know, fluff.

I ran multiple tests and variations, such as single threaded execution and adjusting the proc to return data. Queries were consistent in taking an additional ~.2ms of CPU overhead per execution, beyond what dmvs and QDS reported. Additional details for the curious: SQL Server 2016 SP2 Developer, 3.18GHz clock.

I imagine this overhead number is highly dependent on other factors, but it’s good enough for me to start estimating the actual costs of those tiny queries that get repeated a bajillion times. You might know them, the ones where developers loop through a couple hundred thousand values because of EF?

When I do the math, for a 32-core server, a 10% increase in CPU would need 57.6 million executions an hour, or 16k/sec. Generalized, a 10% CPU overhead needs 500 executions/sec/core. Despite a lot of assumptions, I’m happy to have a rule of thumb to start from. I hope you find the estimate useful too, but remember: this counts as a performance stat, and all performance stats are lies.

Round Robin Hunting

Simple does not equal boring. Should not. But I have a really bad habit of ignoring “simple” things I think I understand, only to find out later that some subtle interaction turns “boring” into “aggggghhhh!” This was the case with Round Robin Distribute Streams. Here’s its mugshot:

All it does is take rows from a single producer thread and distribute them in turn, or round robin, to multiple consumer threads. It looks something like this:

Simple, right? Nothing else to see, time to move on, etc? That’s what I thought, until I started seeing Round Robin pop up in a bizarre class of not-so-simple plans, plans involving intra-query parallel deadlocks.

Parallel deadlocks are a fascinating topic I hope to cover later, but let’s start just by reproducing a Round Robin. It turns out that hunting for this operator is an instructive journey in itself.

Usually parallelism occurs on large tables, and access on the base tables is parallel without a need to distribute streams, like below:

Also, certain conditions need to be true for parallelism to occur. For example, no matter how large your table is, a simple SELECT * is going to be serial. Stuffing rows into network packets is done serially, and if there’s no other work to distribute, the whole plan avoids parallelism. Add a WHERE clause or change to a SELECT COUNT(*), and you can start to see it.

Maybe if the table is really small, SQL Server will distribute streams instead of parallel scanning. Let’s build a small table:

DROP TABLE IF EXISTS dbo.boringtable
GO

CREATE TABLE dbo.boringtable (
ID INT IDENTITY PRIMARY KEY,
yawn CHAR(100)

--1 row into the boring table (can change number of rows through TOP(N))
INSERT dbo.boringtable
SELECT TOP (1) REPLICATE('z',100)
FROM master..spt_values a
CROSS JOIN master..spt_values b
CROSS JOIN master..spt_values c

Now, let’s construct a parallel query, and since this is just a test, I’m going to cheat a little with trace flags. I use a WHERE clause so SQL Server sees there’s work to distribute, then I use 8649 to bully the optimizer into using parallelism and 8757 to skip a trivial plan (so that the optimizer can consider parallelism).

SELECT *
FROM dbo.boringtable
WHERE yawn LIKE 'z%'
OPTION(
QUERYTRACEON 8649
,QUERYTRACEON 8757
)

Well that’s interesting. Even though the table has one row – and SQL Server knows it – the table is accessed in a parallel way, with just a single lucky thread getting to do anything.

So small tables don’t cause a Distribute Streams. Maybe we can use something that prevents the scan from being parallel. There happen to be numerous options, but my favorite one is the backwards scan. SQL Server can do an ascending scan (assuming that’s the natural index order) in parallel, but not descending. Weird, right? Also useful for demos. Let’s add an ORDER BY DESC.

SELECT *
FROM dbo.boringtable
WHERE yawn LIKE 'z%'
ORDER BY ID DESC
OPTION(
QUERYTRACEON 8649
,QUERYTRACEON 8757
)

That didn’t quite work. Instead of a descending scan, we get a parallel ascending scan followed by a descending sort, because the cost is so low on a single-row table.

The next goal is to prevent the Sort. We can discourage through costing, or use a TOP to force a separate region of the plan.

;WITH CTE AS (
SELECT TOP 1 *
FROM dbo.boringtable
ORDER BY ID DESC
)

SELECT *
FROM CTE
WHERE yawn LIKE 'z%'
OPTION(
QUERYTRACEON 8649
,QUERYTRACEON 8757 --actually unnecessary now
)

Success! Highly contrived, but still a success, and it’s actually similar to cases where I see Round Robin in the wild: a descending scan feeding a parallel section.

Say, what happens if we tell SQL Server this is a large table and then do a descending scan?

UPDATE STATISTICS dbo.BoringTable WITH PAGECOUNT = 10000, ROWCOUNT = 5000000
GO

SELECT TOP 10 *
FROM dbo.boringtable
WHERE yawn LIKE '%a%'
ORDER BY ID DESC
OPTION(
QUERYTRACEON 8649
,QUERYTRACEON 8757
)

Huh…

Percentage Non Grata

Percentages? Junk. Yes, this right here:

It’s junk. It’s not the amount of time spent, it’s not the proportion of resources used, it’s not the fraction of rows, it’s junk. There’s a more nuanced explanation, but this is my pet peeve dangit, and I’m going on a repro rant.

Behold, two tables with a single row each.

DROP TABLE IF EXISTS dbo.A
DROP TABLE IF EXISTS dbo.B

CREATE TABLE dbo.A (ID int)
CREATE TABLE dbo.B (ID int)

INSERT dbo.A VALUES (1)
INSERT dbo.B VALUES (1)

Mess with statistics, saying there are 10000 rows in B, and compare the SELECT *s.

UPDATE STATISTICS dbo.B WITH ROWCOUNT = 10000

SELECT * FROM A
SELECT * FROM B

Hey, that actual plan has different percentages even though the table scans take the same time with the same rows…

Recreate the tables to clear the bad stats. This query reads 1 row from A, and 0 from B. But they have the same percents!

SELECT TOP (1) *
FROM (
	SELECT * FROM A
	UNION ALL
	SELECT * FROM B
	) u
100%…each?!

Estimated plans can have problems too. Check this out.

IF (SELECT 1) = 1
SELECT * FROM A
261185% trustworthy

Those cost percentages, despite whatever myth you’ve heard, do not correspond to actual resource usage. What are they? Leftover guesses. SQL Server uses a cost-based optimizer, and those cost numbers come from the optimizer’s guesswork. Get an estimated plan, note the percents, then run with the actual plan, and you’ll see the exact same costs (with usual “It Depends” edge-case disclaimer). They’re leftovers, not even reheated.

And pretty often, you’re looking at a plan because SQL Server guessed wrong. Which means those cost percentages are close to worthless when tuning.

So please, stop it. Percentage misunderstanding is too common, even among Senior DBAs. When we introduce people to execution plans, the best thing to say is that these costs are junk.

Debugging Dataflow

Sometimes I try to write useful, informative posts. This isn’t one of them. Unless, like me, you’re trying to get better at windbg in SQL Server.

I’ve long known that “rows flow from right to left in an execution plan.” But what does it mean? What’s actually happening? I had my suspicions, but I wasn’t able to figure it out by just stepping through the code…there’s a lot of it. The solution finally came when I learned a new type of breakpoint that triggers after memory is accessed.

I start by inserting a row into a table. The pattern “AAAAAAAA” is really easy to spot in hex, so I make sure the row has that value (as an int).

DROP TABLE IF EXISTS dbo.PageTest
GO

CREATE TABLE dbo.PageTest(ID int PRIMARY KEY)

INSERT dbo.PageTest
VALUES (CONVERT(int,0xAAAAAAAA))

Then, grab the page location in memory using DBCC PAGE

DECLARE @db int = DB_ID()
DECLARE @page int
SELECT TOP (1) @page = allocated_page_page_id
FROM sys.dm_db_database_page_allocations(DB_ID(),OBJECT_ID('dbo.PageTest'),NULL,NULL,'DETAILED')
WHERE is_allocated = 1 
AND is_iam_page = 0

DBCC TRACEON(3604)
DBCC PAGE(@db,1,@page,1)

If I examine that memory location in windbg, I can see my row! In my case, the int value begins at 238d16e8064. Let’s set a breakpoint with “ba r1 <memory address>” and select * from table.

Examining the page in memory
Breakpoint on row read

Sure enough, I hit the breakpoint and…oh hey, is that another memory address? I wonder what’s in there after I let the line run (F11).

Hey, that’s my row!

So the row value gets copied to memory. Maybe this value in memory is how the row “gets passed” operator to operator. Let’s prove it!

Run a query with an extra operator. I jump through some hoops to get a separate Filter, because…uh, because I can. Maybe another plan would have been simpler. Whatever.

SELECT *
FROM dbo.PageTest WITH(FORCESCAN)
WHERE ID < 0
OPTION(QUERYTRACEON 9130)

This time I set a breakpoint on that new memory address the row is copied to. Let’s see if the Filter accesses it.

Success! This memory address is how the row is getting passed!

The last access of that address seems to be when row data is getting prepared for send-off (TDS being the clue here).

Hmm, I wonder what happens if I edit the memory value right after the filter looks at it? Hold my beer.

Editing memory
Time to file a support case

Ok, fine. Breaking SQL Server by editing memory isn’t actually that impressive. But it’s fun! And in the process I’ve come up with some questions and experiments that might be interesting to other (read: normal) people. Those exercises will have to wait for another day though, because windbg kept me up wayyyy too late.

Two Easy ASYNC_NETWORK_IO Demos

There are lots of ways to do dumb stuff with SQL Server. Fun, right? If you’re familiar with ASYNC_NETWORK_IO, then you’re likely aware of the dumbness that causes it.

Otherwise, here’s a quick rundown. Client programs ask for data from the database. If SQL Server has data ready to give, but the client isn’t asking for it yet, worker threads stand around tapping their feet, whistling the Jeopardy tune, and producing ASYNC_NETWORK_IO waits.

There are two common reasons for the client to be slower at receiving data than SQL Server is in providing it. The first is a SELECT * kind of situation, where the client is asking for a large (for certain values of large) amount of data. The other reason is when the client slowly processes data as it’s received, commonly row-by-row.

Sound like an “application issue”? Yeah, sure is. Which means you might need some extra proof for your well informed and highly talented developers. Here, have a PowerShell demo.

$qry ="SELECT TOP 10 ROW_NUMBER() OVER(ORDER BY (SELECT 'joe')), REPLICATE('a',8000) FROM master.dbo.spt_values"
Invoke-Sqlcmd -server "." -query $qry | ForEach{Start-Sleep -Seconds 1; $_}

Oh, and in case you don’t have your own, here’s a simple script to grab waits in 20s. Make sure to start it before the PS script.

DROP TABLE IF EXISTS #waits

SELECT *
INTO #waits
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN ('LOGMGR_QUEUE','CHECKPOINT_QUEUE','LAZYWRITER_SLEEP','FT_IFTS_SCHEDULER_IDLE_WAIT','XE_DISPATCHER_WAIT',
'REQUEST_FOR_DEADLOCK_SEARCH','XE_TIMER_EVENT','HADR_FILESTREAM_IOMGR_IOCOMPLETION','DIRTY_PAGE_POLL','SQLTRACE_INCREMENTAL_FLUSH_SLEEP',
'QDS_PERSIST_TASK_MAIN_LOOP_SLEEP','BROKER_TO_FLUSH','SLEEP_TASK','SP_SERVER_DIAGNOSTICS_SLEEP','BROKER_TASK_STOP','HADR_WORK_QUEUE','HADR_LOGCAPTURE_WAIT',
'BROKER_TRANSMITTER','HADR_CLUSAPI_CALL','QDS_ASYNC_QUEUE','REDO_THREAD_PENDING_WORK','HADR_TIMER_TASK')

WAITFOR DELAY '00:00:20'

SELECT ws.wait_type, ws.wait_time_ms-w.wait_time_ms AS wait_ms, ws.waiting_tasks_count-w.waiting_tasks_count AS wait_count
FROM sys.dm_os_wait_stats ws
JOIN #waits w ON ws.wait_type = w.wait_type
WHERE ws.waiting_tasks_count > w.waiting_tasks_count
ORDER BY wait_ms DESC

Maybe you want something a little closer to the application code that’s actually causing the problem. Guess what, you can call .NET objects in PowerShell!

$conn = New-Object -TypeName System.Data.SqlClient.SqlConnection("Server = .;Integrated Security = True")
$qry = "SELECT TOP 1000 ROW_NUMBER() OVER(ORDER BY (SELECT 'joe')), REPLICATE('abcd',2000), REPLICATE('FROM',2000), REPLICATE('TSQL',2000) FROM master.dbo.spt_values"
$cmd = New-Object System.Data.SqlClient.SqlCommand

$conn.Open()

$cmd.CommandText = $qry
$cmd.Connection = $conn

$reader = $cmd.ExecuteReader()

while ($reader.Read()) {$reader[0];Start-Sleep -Milliseconds 5}

$conn.Close()

You probably noticed that my queries include large extra columns. If you run the script without them…

$qry ="SELECT TOP 10 ROW_NUMBER() OVER(ORDER BY (SELECT 'joe')) FROM master.dbo.spt_values"
Invoke-Sqlcmd -server "." -query $qry | ForEach{Start-Sleep -Seconds 1; $_}

…the ASYNC_NETWORK_IO waits disappear, even though the row-by-row processing remains! There’s a reason for this: SQL Server doesn’t actually send results row-at-a-time, but stuffs as many as it can into a packet before shipping it off. Without the fluff, all 10 integers arrive at once, and the script no longer leaves the database waiting.

Now for a conversation with those skilled and deeply knowledgeable developers…

Slow Reports for Query Store

My first “Holy crap!” moment with Query Store was when I enabled it in production and the dang thing promptly cleared the plan cache (as designed, apparently). My second was seeing the next day’s CPU utilization nearly doubled due to spinlock contention (since fixed in a CU). Talk about side effects. In light of this, do you think I have Query Store enabled in production? Yes, yes I do, because it’s just so useful.

The Problem

My other big complaint is that often the built-in reports don’t work. You know, the ones double-clickable in SSMS like “Top Resource Consuming Queries.” All I get is Waiting… that runs forever.

I once let it run for a couple hours with no luck. And killing the query somehow cleared the plan cache too! Holy crap #3. Thankfully I have an AG, and my good-enough workaround was to run the reports on a secondary while I pursued other problems.

Then, one day I had to check Query Store in a lower environment, and behold, I had the same slowness issue, but the report actually finished! Pent up annoyance boiled into action, and I pulled the query, ran it, snagged the actual plan, and caught the surprise culprit.

In all the slow variations I found, the pattern was the same. Repeated access to a TVF packaged with sys.query_store dmvs would consume huge amounts of CPU, most commonly occurring alongside a Row Count Spool. It didn’t even matter if the TVF returned zero rows – it was just the number of access times, averaging around 60ms each on my production system. Multiply that across hundreds of thousands of plans/queries being checked, and you have a recipe for slow.

The solution will be to set up plan guides for the SSMS queries that avoid repeated reads of the TVFs. But why stop here? [Actually, feel free to stop here. There’s nothing but gratuitous nerdery until the fix-it scripts] I have PerfView, and I know how to use it!

Fruitless Investigation

Well that’s interesting – pretty much all CPU went into CopyOutAllPlans. But why stop here? I have windbg, and I pretend to know how to use it!

I learned a new trick in windbg, and discovered that AdvanceToNextPlan was getting called hundreds of thousands of times per TVF execution – again, regardless of whether any rows were returned. My current hypothesis is that this TVF has to scan the entire memory allocation of its object, each and every time. Unfortunately, I wasn’t able to figure out which memory object it falls under, as FREESYSTEMCACHE(‘ALL’) did nothing to the in-memory QS data. I’m trying to improve my windbg skills enough to figure out memory information from debugging, but that may be a ways off.

Solution

It was a little tough getting a plan guide to work, due to a requirement that the query match character for character, but my solution is to pull the exact text from the plan as it’s running. So the fix is to:
1) Open up Top Resource Consuming Queries
2) Run the below script while the report is still spinning
3) Cancel the report and run again.
The process can be repeated for the reports you actually look at, just remember to use a new plan guide name!

--USE YourDB
DECLARE @badqry NVARCHAR(MAX)

SELECT TOP (1) @badqry = SUBSTRING(st.text, (r.statement_start_offset/2) + 1,  
    ((CASE statement_end_offset   
        WHEN -1 THEN DATALENGTH(st.text)  
        ELSE r.statement_end_offset 
	END - r.statement_start_offset)/2) + 1)
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) st
WHERE st.text LIKE '(@results_row_count int,@interval_start_time%'

IF @badqry IS NULL
RAISERROR('Missed the plan',16,1)

EXEC sp_create_plan_guide
@name = N'Fixing QueryStore Top Duration',
@stmt = @badqry,
@type = N'SQL',
@module_or_batch = NULL,
@params = N'@results_row_count int,@interval_start_time datetimeoffset(7),@interval_end_time datetimeoffset(7)',
@hints = N'OPTION (HASH JOIN, LOOP JOIN)'

I’m certain there’s a more efficient combination of hints than (HASH JOIN, LOOP JOIN) to prevent the problem, but these have worked well enough for me so far. Please let me know if you find something better.

There aren’t many folks using Query Store, or even 2016 yet, so I guess I’m paying the early-adopter tax (you’re welcome). But for all the frustrating issues I’ve encountered, I still see a genuinely useful product mixed in. And the “Holy crap!” moments keep pushing me to deepen my skills and write mildly entertaining blog posts wherein I pretend to use windbg.

The Three Physical Joins, Visualized

Less than a month into my first SQL job, I was handed a report that “suddenly stopped working.” I followed the Excel sheet to a data connection to a stored procedure to a SELECT to a view on a view on a view.
The difference between the query that worked and the one that never finished (at least, over the 10 hours I let it run), was a Hash versus a Loop join. It was my first introduction to query plans and to the importance of physical joins.

But what are they? Distinct from logical joins (INNER, OUTER, CROSS, etc.), physical joins are the different algorithms available to actually compare data from different tables. You don’t have to specify which one SQL Server will use, and usually SQL Server gets it right. But there’s a large gap between usually and always, and the fine art of query tuning lives in that gap. Having a good understanding of the joins is essential for mastering tuning (if mastery is even possible), so let’s take a look.

Nested Loops

Overview: The simplest algorithm is Nested Loops. Imagine you have two piles [tables] of values, and you want to find the matches. Take the first value from the first pile, and compare it to each value in the second pile. Then grab the next value from the first pile, and compare to everything in the second. Grab the third value…wash, rinse, repeat. If you write the pseudo-code for the algorithm, it looks something like below.

For Each value in pile1
    For Each value in pile2
        If pile1.value = pile2.value
        Return pile1.value, pile2.value

Hey, it’s one For-Each loop inside of another loop…you might even say, Nested inside of another. Ahem. Anyways, here’s a visualization.

Drawbacks: An important detail of Nested Loops to notice is that values in the bottom (inner) input get read multiple times, which can quickly grow out of control for large tables. This isn’t an issue if an index on that inner table allows seeking to the exact match, but otherwise the amount of work to compare rows increases drastically with size.

Merge Join

Overview: Merge Join allows comparison while reading each input value only once. The catch is that each input has to be sorted (and on the compared values). Roughly, start at the beginning of each input and progress through the values, grabbing the next input value from whichever is lower. I have pseudo-code for Merge, but it didn’t seem helpful, so I’m skipping straight to the picture.

Drawbacks: Remember how I said that Merge reads each input value only once? I lied. There’s a special (and not uncommon) case where that’s not true. If both inputs have repeated values, called Many-to-Many, the algorithm gets really complicated. As in, holy-crap-why-is-it-taking-so-long complicated. But if you can guarantee unique values in at least one input (like my gif above shows), you’re golden. So, Merge has two big gotchas: the many-to-many case, and the requirement for both inputs to be sorted.

Hash Match

Overview: The final algorithm is Hash Match, also known as Hash Join. Like Merge, it only needs a single pass through each input. It’s also the most complicated. Here’s how it works (with minor lies simplifications): there are two phases, called the build phase and probe phase. In the build phase, each value from the top input is hashed. The hash and the value are stored in a bucket. There are multiple buckets, and the hash value determines which bucket the hash and original value will be stored in. Then comes the probe phase. For each value in the second input, that value will be hashed, and the hash compared to hashes in the buckets. If the hash matches, then the values stored are checked to confirm the match. With an inner join, values that match are outputted, and otherwise discarded.

Below is my animated Hash Match, but first a note on my fake-for-learning-purposes-only hash function. I take each input number, spell it out, and take the first letter. For example, 1 becomes “one” becomes “o.” 5 becomes “five” becomes “f.”

Drawbacks: So why isn’t everything a Hash Join? Well, surprise surprise, it has its own weaknesses. The biggest one to consider is that “storing hashes and values in buckets” piece. Storing occurs in memory (hopefully), and memory is a limited resource. And, if the Hash Match needs more memory than it was given, it doesn’t get more, but writes (spills) to disk instead, killing performance.

I would probably summarize an introduction to the joins as below

JoinStrengthsCommonly SeenBlows Up When
Nested LoopsSimple, fast with the right indexes.Small joins with proper indexes.Inner table has lots of rows repeatedly read.
MergeVery fast with sorted inputs.Key to key joins. Query includes ORDER BY.Necessary sorts take too long. Many-to-Many.
HashGenerally fast.Heaps. Large joins.Can’t get enough memory.

The three algorithms each have different strengths, weaknesses, hidden gotchas, and bizarre variants. I’ve fallen in love with the subtleties and complexities that arise from deceptively simple starting points, and it pained me to have to skip over important details (such as Merge and Hash joins requiring an equality comparison). There’s really a lot more to write about, which I intend to do…eventually. In the meantime, if you want to read more, I suggest looking at Craig Freedman’s articles and Hugo Kornelis’ site.