Performance Impact of Small Queries

All performance statistics are lies, but some are useful. Yes, just some, and opinions differ about which ones. If you want to start a nerd fight, find a cluster of Senior DBAs at SQL Saturday and tell them how much you love Page Life Expectancy.

As a tuning aficionado, I happen to adore CPU metrics, such as from exec_query_stats and Query Store. They tend to be extremely helpful in finding a server’s worst offending queries. But I see cases where a tiny SELECT hitting a million times had more effect on the server than Query Store CPU stats would suggest.

This makes sense – every query is going to have overhead, such as sending rows to the client, that probably isn’t captured in measurements. But can we prove it, and maybe get a sense of how much is missed?

Let’s create a small query, spam it against a server, and compare what the server says about its CPU to SQL Server’s metrics. Here’s the demo table, and a proc to seek a single row from it (designed in a way that doesn’t actually return any rows).

DROP TABLE IF EXISTS dbo.seektest
GO

CREATE TABLE dbo.seektest (
ID INT IDENTITY PRIMARY KEY,
junk CHAR(1)
)

INSERT dbo.seektest
SELECT TOP (5000000) 'a'
FROM master.dbo.spt_values a
CROSS JOIN master.dbo.spt_values b
GO

CREATE PROC dbo.SeekTesting1
(@ID int)
AS
BEGIN

DECLARE @b int

SELECT @b = ID
FROM dbo.SeekTest
WHERE ID = @ID

END

 

Now use a script to loop this pointless busy work (I’m channeling my inner junior developer here). We don’t want to use a program on the same server, since that will affect CPU. I’m lucky enough to have another large server available on the same network, where I run this PowerShell script adapted from Merrill Aldrich. (The original link to SQL Blog is dead, but you can find Google’s cached version when searching for “Quick and Dirty PowerShell SQL Server Load Test”)

$numWorkers = 32

$startTime = ( Get-Date ).AddSeconds(20)
$endTime = ( Get-Date ).AddSeconds(30)

$jobscript = {
    param( $startTime, $endTime, $conn )
         
    while ( ( Get-Date ) -lt $startTime ) { Start-Sleep -Milliseconds 100 }
    
    while ( ( Get-Date ) -lt $endTime ) { 
        $n = Get-Random -Maximum 5000001;
        Invoke-Sqlcmd -Query "EXEC dbo.SeekTesting1 $n" -Server "YourTestServer" -Database "TestDB";
        }
}

(1..$numWorkers) | foreach {
    Start-Job $jobscript -ArgumentList $startTime,$endTime,$conn
}

#Run below line for cleanup
#Get-Job | Remove-Job

Launching the script gives me a 10 second window where I see CPU climb from 1% to about 20%. Since my test rig is 48 cores, I calculate this as about 91s of added CPU activity. Then I compare to what exists in dmvs/Query Store.

SELECT st.text, qs.*
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st
WHERE qs.query_hash = 0x0946FE740B52871B

Both of these show about 5s of CPU for the query – way less than 91s. By the way, you can reset stats/QDS with the below script between runs (you are using a test server, right?)

DBCC FREEPROCCACHE
GO

ALTER DATABASE TestDB SET QUERY_STORE CLEAR;

I pulled up PerfView to see where time was spent, and grabbed a flame graph. The boxed section is the actual query execution, and probably the only part that gets measured in SQL Server. The rest goes to things like parsing, checking security, and sending data. You know, fluff.

I ran multiple tests and variations, such as single threaded execution and adjusting the proc to return data. Queries were consistent in taking an additional ~.2ms of CPU overhead per execution, beyond what dmvs and QDS reported. Additional details for the curious: SQL Server 2016 SP2 Developer, 3.18GHz clock.

I imagine this overhead number is highly dependent on other factors, but it’s good enough for me to start estimating the actual costs of those tiny queries that get repeated a bajillion times. You might know them, the ones where developers loop through a couple hundred thousand values because of EF?

When I do the math, for a 32-core server, a 10% increase in CPU would need 57.6 million executions an hour, or 16k/sec. Generalized, a 10% CPU overhead needs 500 executions/sec/core. Despite a lot of assumptions, I’m happy to have a rule of thumb to start from. I hope you find the estimate useful too, but remember: this counts as a performance stat, and all performance stats are lies.

Round Robin Hunting

Simple does not equal boring. Should not. But I have a really bad habit of ignoring “simple” things I think I understand, only to find out later that some subtle interaction turns “boring” into “aggggghhhh!” This was the case with Round Robin Distribute Streams. Here’s its mugshot:

All it does is take rows from a single producer thread and distribute them in turn, or round robin, to multiple consumer threads. It looks something like this:

Simple, right? Nothing else to see, time to move on, etc? That’s what I thought, until I started seeing Round Robin pop up in a bizarre class of not-so-simple plans, plans involving intra-query parallel deadlocks.

Parallel deadlocks are a fascinating topic I hope to cover later, but let’s start just by reproducing a Round Robin. It turns out that hunting for this operator is an instructive journey in itself.

Usually parallelism occurs on large tables, and access on the base tables is parallel without a need to distribute streams, like below:

Also, certain conditions need to be true for parallelism to occur. For example, no matter how large your table is, a simple SELECT * is going to be serial. Stuffing rows into network packets is done serially, and if there’s no other work to distribute, the whole plan avoids parallelism. Add a WHERE clause or change to a SELECT COUNT(*), and you can start to see it.

Maybe if the table is really small, SQL Server will distribute streams instead of parallel scanning. Let’s build a small table:

DROP TABLE IF EXISTS dbo.boringtable
GO

CREATE TABLE dbo.boringtable (
ID INT IDENTITY PRIMARY KEY,
yawn CHAR(100)

--1 row into the boring table (can change number of rows through TOP(N))
INSERT dbo.boringtable
SELECT TOP (1) REPLICATE('z',100)
FROM master..spt_values a
CROSS JOIN master..spt_values b
CROSS JOIN master..spt_values c

Now, let’s construct a parallel query, and since this is just a test, I’m going to cheat a little with trace flags. I use a WHERE clause so SQL Server sees there’s work to distribute, then I use 8649 to bully the optimizer into using parallelism and 8757 to skip a trivial plan (so that the optimizer can consider parallelism).

SELECT *
FROM dbo.boringtable
WHERE yawn LIKE 'z%'
OPTION(
QUERYTRACEON 8649
,QUERYTRACEON 8757
)

Well that’s interesting. Even though the table has one row – and SQL Server knows it – the table is accessed in a parallel way, with just a single lucky thread getting to do anything.

So small tables don’t cause a Distribute Streams. Maybe we can use something that prevents the scan from being parallel. There happen to be numerous options, but my favorite one is the backwards scan. SQL Server can do an ascending scan (assuming that’s the natural index order) in parallel, but not descending. Weird, right? Also useful for demos. Let’s add an ORDER BY DESC.

SELECT *
FROM dbo.boringtable
WHERE yawn LIKE 'z%'
ORDER BY ID DESC
OPTION(
QUERYTRACEON 8649
,QUERYTRACEON 8757
)

That didn’t quite work. Instead of a descending scan, we get a parallel ascending scan followed by a descending sort, because the cost is so low on a single-row table.

The next goal is to prevent the Sort. We can discourage through costing, or use a TOP to force a separate region of the plan.

;WITH CTE AS (
SELECT TOP 1 *
FROM dbo.boringtable
ORDER BY ID DESC
)

SELECT *
FROM CTE
WHERE yawn LIKE 'z%'
OPTION(
QUERYTRACEON 8649
,QUERYTRACEON 8757 --actually unnecessary now
)

Success! Highly contrived, but still a success, and it’s actually similar to cases where I see Round Robin in the wild: a descending scan feeding a parallel section.

Say, what happens if we tell SQL Server this is a large table and then do a descending scan?

UPDATE STATISTICS dbo.BoringTable WITH PAGECOUNT = 10000, ROWCOUNT = 5000000
GO

SELECT TOP 10 *
FROM dbo.boringtable
WHERE yawn LIKE '%a%'
ORDER BY ID DESC
OPTION(
QUERYTRACEON 8649
,QUERYTRACEON 8757
)

Huh…

Percentage Non Grata

Percentages? Junk. Yes, this right here:

It’s junk. It’s not the amount of time spent, it’s not the proportion of resources used, it’s not the fraction of rows, it’s junk. There’s a more nuanced explanation, but this is my pet peeve dangit, and I’m going on a repro rant.

Behold, two tables with a single row each.

DROP TABLE IF EXISTS dbo.A
DROP TABLE IF EXISTS dbo.B

CREATE TABLE dbo.A (ID int)
CREATE TABLE dbo.B (ID int)

INSERT dbo.A VALUES (1)
INSERT dbo.B VALUES (1)

Mess with statistics, saying there are 10000 rows in B, and compare the SELECT *s.

UPDATE STATISTICS dbo.B WITH ROWCOUNT = 10000

SELECT * FROM A
SELECT * FROM B

Hey, that actual plan has different percentages even though the table scans take the same time with the same rows…

Recreate the tables to clear the bad stats. This query reads 1 row from A, and 0 from B. But they have the same percents!

SELECT TOP (1) *
FROM (
	SELECT * FROM A
	UNION ALL
	SELECT * FROM B
	) u
100%…each?!

Estimated plans can have problems too. Check this out.

IF (SELECT 1) = 1
SELECT * FROM A
261185% trustworthy

Those cost percentages, despite whatever myth you’ve heard, do not correspond to actual resource usage. What are they? Leftover guesses. SQL Server uses a cost-based optimizer, and those cost numbers come from the optimizer’s guesswork. Get an estimated plan, note the percents, then run with the actual plan, and you’ll see the exact same costs (with usual “It Depends” edge-case disclaimer). They’re leftovers, not even reheated.

And pretty often, you’re looking at a plan because SQL Server guessed wrong. Which means those cost percentages are close to worthless when tuning.

So please, stop it. Percentage misunderstanding is too common, even among Senior DBAs. When we introduce people to execution plans, the best thing to say is that these costs are junk.

Debugging Dataflow

Sometimes I try to write useful, informative posts. This isn’t one of them. Unless, like me, you’re trying to get better at windbg in SQL Server.

I’ve long known that “rows flow from right to left in an execution plan.” But what does it mean? What’s actually happening? I had my suspicions, but I wasn’t able to figure it out by just stepping through the code…there’s a lot of it. The solution finally came when I learned a new type of breakpoint that triggers after memory is accessed.

I start by inserting a row into a table. The pattern “AAAAAAAA” is really easy to spot in hex, so I make sure the row has that value (as an int).

DROP TABLE IF EXISTS dbo.PageTest
GO

CREATE TABLE dbo.PageTest(ID int PRIMARY KEY)

INSERT dbo.PageTest
VALUES (CONVERT(int,0xAAAAAAAA))

Then, grab the page location in memory using DBCC PAGE

DECLARE @db int = DB_ID()
DECLARE @page int
SELECT TOP (1) @page = allocated_page_page_id
FROM sys.dm_db_database_page_allocations(DB_ID(),OBJECT_ID('dbo.PageTest'),NULL,NULL,'DETAILED')
WHERE is_allocated = 1 
AND is_iam_page = 0

DBCC TRACEON(3604)
DBCC PAGE(@db,1,@page,1)

If I examine that memory location in windbg, I can see my row! In my case, the int value begins at 238d16e8064. Let’s set a breakpoint with “ba r1 <memory address>” and select * from table.

Examining the page in memory
Breakpoint on row read

Sure enough, I hit the breakpoint and…oh hey, is that another memory address? I wonder what’s in there after I let the line run (F11).

Hey, that’s my row!

So the row value gets copied to memory. Maybe this value in memory is how the row “gets passed” operator to operator. Let’s prove it!

Run a query with an extra operator. I jump through some hoops to get a separate Filter, because…uh, because I can. Maybe another plan would have been simpler. Whatever.

SELECT *
FROM dbo.PageTest WITH(FORCESCAN)
WHERE ID < 0
OPTION(QUERYTRACEON 9130)

This time I set a breakpoint on that new memory address the row is copied to. Let’s see if the Filter accesses it.

Success! This memory address is how the row is getting passed!

The last access of that address seems to be when row data is getting prepared for send-off (TDS being the clue here).

Hmm, I wonder what happens if I edit the memory value right after the filter looks at it? Hold my beer.

Editing memory
Time to file a support case

Ok, fine. Breaking SQL Server by editing memory isn’t actually that impressive. But it’s fun! And in the process I’ve come up with some questions and experiments that might be interesting to other (read: normal) people. Those exercises will have to wait for another day though, because windbg kept me up wayyyy too late.

Two Easy ASYNC_NETWORK_IO Demos

There are lots of ways to do dumb stuff with SQL Server. Fun, right? If you’re familiar with ASYNC_NETWORK_IO, then you’re likely aware of the dumbness that causes it.

Otherwise, here’s a quick rundown. Client programs ask for data from the database. If SQL Server has data ready to give, but the client isn’t asking for it yet, worker threads stand around tapping their feet, whistling the Jeopardy tune, and producing ASYNC_NETWORK_IO waits.

There are two common reasons for the client to be slower at receiving data than SQL Server is in providing it. The first is a SELECT * kind of situation, where the client is asking for a large (for certain values of large) amount of data. The other reason is when the client slowly processes data as it’s received, commonly row-by-row.

Sound like an “application issue”? Yeah, sure is. Which means you might need some extra proof for your well informed and highly talented developers. Here, have a PowerShell demo.

$qry ="SELECT TOP 10 ROW_NUMBER() OVER(ORDER BY (SELECT 'joe')), REPLICATE('a',8000) FROM master.dbo.spt_values"
Invoke-Sqlcmd -server "." -query $qry | ForEach{Start-Sleep -Seconds 1; $_}

Oh, and in case you don’t have your own, here’s a simple script to grab waits in 20s. Make sure to start it before the PS script.

DROP TABLE IF EXISTS #waits

SELECT *
INTO #waits
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN ('LOGMGR_QUEUE','CHECKPOINT_QUEUE','LAZYWRITER_SLEEP','FT_IFTS_SCHEDULER_IDLE_WAIT','XE_DISPATCHER_WAIT',
'REQUEST_FOR_DEADLOCK_SEARCH','XE_TIMER_EVENT','HADR_FILESTREAM_IOMGR_IOCOMPLETION','DIRTY_PAGE_POLL','SQLTRACE_INCREMENTAL_FLUSH_SLEEP',
'QDS_PERSIST_TASK_MAIN_LOOP_SLEEP','BROKER_TO_FLUSH','SLEEP_TASK','SP_SERVER_DIAGNOSTICS_SLEEP','BROKER_TASK_STOP','HADR_WORK_QUEUE','HADR_LOGCAPTURE_WAIT',
'BROKER_TRANSMITTER','HADR_CLUSAPI_CALL','QDS_ASYNC_QUEUE','REDO_THREAD_PENDING_WORK','HADR_TIMER_TASK')

WAITFOR DELAY '00:00:20'

SELECT ws.wait_type, ws.wait_time_ms-w.wait_time_ms AS wait_ms, ws.waiting_tasks_count-w.waiting_tasks_count AS wait_count
FROM sys.dm_os_wait_stats ws
JOIN #waits w ON ws.wait_type = w.wait_type
WHERE ws.waiting_tasks_count > w.waiting_tasks_count
ORDER BY wait_ms DESC

Maybe you want something a little closer to the application code that’s actually causing the problem. Guess what, you can call .NET objects in PowerShell!

$conn = New-Object -TypeName System.Data.SqlClient.SqlConnection("Server = .;Integrated Security = True")
$qry = "SELECT TOP 1000 ROW_NUMBER() OVER(ORDER BY (SELECT 'joe')), REPLICATE('abcd',2000), REPLICATE('FROM',2000), REPLICATE('TSQL',2000) FROM master.dbo.spt_values"
$cmd = New-Object System.Data.SqlClient.SqlCommand

$conn.Open()

$cmd.CommandText = $qry
$cmd.Connection = $conn

$reader = $cmd.ExecuteReader()

while ($reader.Read()) {$reader[0];Start-Sleep -Milliseconds 5}

$conn.Close()

You probably noticed that my queries include large extra columns. If you run the script without them…

$qry ="SELECT TOP 10 ROW_NUMBER() OVER(ORDER BY (SELECT 'joe')) FROM master.dbo.spt_values"
Invoke-Sqlcmd -server "." -query $qry | ForEach{Start-Sleep -Seconds 1; $_}

…the ASYNC_NETWORK_IO waits disappear, even though the row-by-row processing remains! There’s a reason for this: SQL Server doesn’t actually send results row-at-a-time, but stuffs as many as it can into a packet before shipping it off. Without the fluff, all 10 integers arrive at once, and the script no longer leaves the database waiting.

Now for a conversation with those skilled and deeply knowledgeable developers…

Slow Reports for Query Store

My first “Holy crap!” moment with Query Store was when I enabled it in production and the dang thing promptly cleared the plan cache (as designed, apparently). My second was seeing the next day’s CPU utilization nearly doubled due to spinlock contention (since fixed in a CU). Talk about side effects. In light of this, do you think I have Query Store enabled in production? Yes, yes I do, because it’s just so useful.

The Problem

My other big complaint is that often the built-in reports don’t work. You know, the ones double-clickable in SSMS like “Top Resource Consuming Queries.” All I get is Waiting… that runs forever.

I once let it run for a couple hours with no luck. And killing the query somehow cleared the plan cache too! Holy crap #3. Thankfully I have an AG, and my good-enough workaround was to run the reports on a secondary while I pursued other problems.

Then, one day I had to check Query Store in a lower environment, and behold, I had the same slowness issue, but the report actually finished! Pent up annoyance boiled into action, and I pulled the query, ran it, snagged the actual plan, and caught the surprise culprit.

In all the slow variations I found, the pattern was the same. Repeated access to a TVF packaged with sys.query_store dmvs would consume huge amounts of CPU, most commonly occurring alongside a Row Count Spool. It didn’t even matter if the TVF returned zero rows – it was just the number of access times, averaging around 60ms each on my production system. Multiply that across hundreds of thousands of plans/queries being checked, and you have a recipe for slow.

The solution will be to set up plan guides for the SSMS queries that avoid repeated reads of the TVFs. But why stop here? [Actually, feel free to stop here. There’s nothing but gratuitous nerdery until the fix-it scripts] I have PerfView, and I know how to use it!

Fruitless Investigation

Well that’s interesting – pretty much all CPU went into CopyOutAllPlans. But why stop here? I have windbg, and I pretend to know how to use it!

I learned a new trick in windbg, and discovered that AdvanceToNextPlan was getting called hundreds of thousands of times per TVF execution – again, regardless of whether any rows were returned. My current hypothesis is that this TVF has to scan the entire memory allocation of its object, each and every time. Unfortunately, I wasn’t able to figure out which memory object it falls under, as FREESYSTEMCACHE(‘ALL’) did nothing to the in-memory QS data. I’m trying to improve my windbg skills enough to figure out memory information from debugging, but that may be a ways off.

Solution

It was a little tough getting a plan guide to work, due to a requirement that the query match character for character, but my solution is to pull the exact text from the plan as it’s running. So the fix is to:
1) Open up Top Resource Consuming Queries
2) Run the below script while the report is still spinning
3) Cancel the report and run again.
The process can be repeated for the reports you actually look at, just remember to use a new plan guide name!

--USE YourDB
DECLARE @badqry NVARCHAR(MAX)

SELECT TOP (1) @badqry = SUBSTRING(st.text, (r.statement_start_offset/2) + 1,  
    ((CASE statement_end_offset   
        WHEN -1 THEN DATALENGTH(st.text)  
        ELSE r.statement_end_offset 
	END - r.statement_start_offset)/2) + 1)
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) st
WHERE st.text LIKE '(@results_row_count int,@interval_start_time%'

IF @badqry IS NULL
RAISERROR('Missed the plan',16,1)

EXEC sp_create_plan_guide
@name = N'Fixing QueryStore Top Duration',
@stmt = @badqry,
@type = N'SQL',
@module_or_batch = NULL,
@params = N'@results_row_count int,@interval_start_time datetimeoffset(7),@interval_end_time datetimeoffset(7)',
@hints = N'OPTION (HASH JOIN, LOOP JOIN)'

I’m certain there’s a more efficient combination of hints than (HASH JOIN, LOOP JOIN) to prevent the problem, but these have worked well enough for me so far. Please let me know if you find something better.

There aren’t many folks using Query Store, or even 2016 yet, so I guess I’m paying the early-adopter tax (you’re welcome). But for all the frustrating issues I’ve encountered, I still see a genuinely useful product mixed in. And the “Holy crap!” moments keep pushing me to deepen my skills and write mildly entertaining blog posts wherein I pretend to use windbg.

The Three Physical Joins, Visualized

Less than a month into my first SQL job, I was handed a report that “suddenly stopped working.” I followed the Excel sheet to a data connection to a stored procedure to a SELECT to a view on a view on a view.
The difference between the query that worked and the one that never finished (at least, over the 10 hours I let it run), was a Hash versus a Loop join. It was my first introduction to query plans and to the importance of physical joins.

But what are they? Distinct from logical joins (INNER, OUTER, CROSS, etc.), physical joins are the different algorithms available to actually compare data from different tables. You don’t have to specify which one SQL Server will use, and usually SQL Server gets it right. But there’s a large gap between usually and always, and the fine art of query tuning lives in that gap. Having a good understanding of the joins is essential for mastering tuning (if mastery is even possible), so let’s take a look.

Nested Loops

Overview: The simplest algorithm is Nested Loops. Imagine you have two piles [tables] of values, and you want to find the matches. Take the first value from the first pile, and compare it to each value in the second pile. Then grab the next value from the first pile, and compare to everything in the second. Grab the third value…wash, rinse, repeat. If you write the pseudo-code for the algorithm, it looks something like below.

For Each value in pile1
    For Each value in pile2
        If pile1.value = pile2.value
        Return pile1.value, pile2.value

Hey, it’s one For-Each loop inside of another loop…you might even say, Nested inside of another. Ahem. Anyways, here’s a visualization.

Drawbacks: An important detail of Nested Loops to notice is that values in the bottom (inner) input get read multiple times, which can quickly grow out of control for large tables. This isn’t an issue if an index on that inner table allows seeking to the exact match, but otherwise the amount of work to compare rows increases drastically with size.

Merge Join

Overview: Merge Join allows comparison while reading each input value only once. The catch is that each input has to be sorted (and on the compared values). Roughly, start at the beginning of each input and progress through the values, grabbing the next input value from whichever is lower. I have pseudo-code for Merge, but it didn’t seem helpful, so I’m skipping straight to the picture.

Drawbacks: Remember how I said that Merge reads each input value only once? I lied. There’s a special (and not uncommon) case where that’s not true. If both inputs have repeated values, called Many-to-Many, the algorithm gets really complicated. As in, holy-crap-why-is-it-taking-so-long complicated. But if you can guarantee unique values in at least one input (like my gif above shows), you’re golden. So, Merge has two big gotchas: the many-to-many case, and the requirement for both inputs to be sorted.

Hash Match

Overview: The final algorithm is Hash Match, also known as Hash Join. Like Merge, it only needs a single pass through each input. It’s also the most complicated. Here’s how it works (with minor lies simplifications): there are two phases, called the build phase and probe phase. In the build phase, each value from the top input is hashed. The hash and the value are stored in a bucket. There are multiple buckets, and the hash value determines which bucket the hash and original value will be stored in. Then comes the probe phase. For each value in the second input, that value will be hashed, and the hash compared to hashes in the buckets. If the hash matches, then the values stored are checked to confirm the match. With an inner join, values that match are outputted, and otherwise discarded.

Below is my animated Hash Match, but first a note on my fake-for-learning-purposes-only hash function. I take each input number, spell it out, and take the first letter. For example, 1 becomes “one” becomes “o.” 5 becomes “five” becomes “f.”

Drawbacks: So why isn’t everything a Hash Join? Well, surprise surprise, it has its own weaknesses. The biggest one to consider is that “storing hashes and values in buckets” piece. Storing occurs in memory (hopefully), and memory is a limited resource. And, if the Hash Match needs more memory than it was given, it doesn’t get more, but writes (spills) to disk instead, killing performance.

I would probably summarize an introduction to the joins as below

JoinStrengthsCommonly SeenBlows Up When
Nested LoopsSimple, fast with the right indexes.Small joins with proper indexes.Inner table has lots of rows repeatedly read.
MergeVery fast with sorted inputs.Key to key joins. Query includes ORDER BY.Necessary sorts take too long. Many-to-Many.
HashGenerally fast.Heaps. Large joins.Can’t get enough memory.

The three algorithms each have different strengths, weaknesses, hidden gotchas, and bizarre variants. I’ve fallen in love with the subtleties and complexities that arise from deceptively simple starting points, and it pained me to have to skip over important details (such as Merge and Hash joins requiring an equality comparison). There’s really a lot more to write about, which I intend to do…eventually. In the meantime, if you want to read more, I suggest looking at Craig Freedman’s articles and Hugo Kornelis’ site.

So is NOLOCK Actually Faster?

NOLOCK has a reputation. Several reputations in fact. To the database snob, NOLOCK is a vestigial practice of the untrained masses, a Pavlovian reflex borne [yes, borne] of fossilized ignorance and apathy. Millions of man-hours of coding have gone into building robust ACID databases that guarantee consistency… to be casually thrown away by peons who can’t be bothered to actually fix their design.

For others, NOLOCK is the two-click salvation when the system is down, the company is losing thousands of dollars a minute, and the full weight of the corporate totem pole is pressing down. Incorrect results? Sure, in theory. Never seen one. NOLOCK works, so why not keep using it?

What about another reputation? That NOLOCK is the “go faster” hint? This one can be tested, and I love testing! There are three main ways it might work:
1) because it doesn’t get blocked
2) because it doesn’t have the overhead of taking locks
3) allocation order scans

Using NOLOCK to get around blocking is…I’m trying to think of a word here…dumb. I’m not talking about running an ad-hoc query on a busy production system where exact results don’t matter. I’m talking about adding the hint to EVERYTHING as a matter of practice. If your queries are slow from blocking, you need to fix how much data they’re reading, often through indexes, or move to RCSI. But others have written more on this.

Regarding allocation order scans, they help the most on severely fragmented tables – the ones with a GUID as clustering key for example. They also seem to help with physical reads. In fact, if you search for “NOLOCK performance testing,” the top result shows using the hint is faster, but the reason in that test is the different scan type. But when your data is in memory and not especially fragmented (you know, the way you’d expect when you’re not being bad at databases), the allocation scan doesn’t help as much.

So what about the second scenario? Ignoring blocking, ignoring allocation scans, does the locking overhead make a difference? Let’s find out!

Here’s my setup – four tables wide and narrow, max column and not.

--narrow
CREATE TABLE dbo.narrow (
ID INT PRIMARY KEY,
junk CHAR(1))

--wide table, no MAX
CREATE TABLE dbo.wide (
ID INT PRIMARY KEY,
junk CHAR(7000))

--narrow, MAX
CREATE TABLE dbo.narrow_max (
ID INT PRIMARY KEY,
junk char(1),
bigjunk VARCHAR(MAX))

--wide, MAX
CREATE TABLE dbo.wide_max (
ID INT PRIMARY KEY,
junk CHAR(7000),
bigjunk VARCHAR(MAX))


INSERT dbo.narrow
SELECT TOP 1000000 ROW_NUMBER() OVER(ORDER BY(SELECT 'Obbish')), 'a'
FROM master..spt_values a
CROSS JOIN master..spt_values b

INSERT dbo.wide
SELECT TOP 1000000 ROW_NUMBER() OVER(ORDER BY(SELECT 'Obbish')), 'a'
FROM master..spt_values a
CROSS JOIN master..spt_values b

INSERT dbo.narrow_max
SELECT TOP 1000000 ROW_NUMBER() OVER(ORDER BY(SELECT 'Obbish')), 'a', 'a'
FROM master..spt_values a
CROSS JOIN master..spt_values b

INSERT dbo.wide_max
SELECT TOP 1000000 ROW_NUMBER() OVER(ORDER BY(SELECT 'Obbish')), 'a','a'
FROM master..spt_values a
CROSS JOIN master..spt_values b

I then run the following SELECT statements that scan the tables. The WHERE clause makes sure no rows get returned and sent over the network. I use a warm cache because that’s a bit more realistic for most environments.

SELECT *
FROM dbo.narrow
WHERE junk = 'b'
ORDER BY ID
OPTION(MAXDOP 1)

SELECT *
FROM dbo.narrow WITH(NOLOCK)
WHERE junk = 'b'
ORDER BY ID
OPTION(MAXDOP 1)

SELECT *
FROM dbo.wide
WHERE junk = 'b'
ORDER BY ID
OPTION(MAXDOP 1)

SELECT *
FROM dbo.wide WITH(NOLOCK)
WHERE junk = 'b'
ORDER BY ID
OPTION(MAXDOP 1)

SELECT *
FROM dbo.narrow_max
WHERE junk = 'b'
ORDER BY ID
OPTION(MAXDOP 1)

SELECT *
FROM dbo.narrow_max WITH(NOLOCK)
WHERE junk = 'b'
ORDER BY ID
OPTION(MAXDOP 1)

SELECT *
FROM dbo.wide_max
WHERE junk = 'b'
ORDER BY ID
OPTION(MAXDOP 1)

SELECT *
FROM dbo.wide_max WITH(NOLOCK)
WHERE junk = 'b'
ORDER BY ID
OPTION(MAXDOP 1)

Average scan speeds (in ms) are as follow:

Table Normal NOLOCK
narrow 51 53
wide 794 525
narrow_max 53 52
wide_max 1433 1124

So yes, it looks like NOLOCK is faster, but only on wider tables in a scan situation. Seriously though, if your system is running a lot of scans on really wide tables, you might want to reconsider if speeding them up is your real priority.

What about an OLTP-style test, with a lot of single-row seeks? I used SQL Query Stresser with 40 threads of 10s WHILE loops and the below procs.

CREATE OR ALTER PROC dbo.seek_narrow
AS
BEGIN
	SELECT ID
	FROM dbo.narrow
	WHERE ID = CONVERT(INT,RAND()*1000000)
END
GO

CREATE OR ALTER PROC dbo.seek_narrow_NOLOCK
AS
BEGIN
	SELECT ID
	FROM dbo.narrow WITH (NOLOCK)
	WHERE ID = CONVERT(INT,RAND()*1000000)
END
GO


CREATE OR ALTER PROC dbo.seek_wide
AS
BEGIN
	SELECT ID
	FROM dbo.wide
	WHERE ID = CONVERT(INT,RAND()*1000000)
END
GO

CREATE OR ALTER PROC dbo.seek_wide_NOLOCK
AS
BEGIN
	SELECT ID
	FROM dbo.wide WITH (NOLOCK)
	WHERE ID = CONVERT(INT,RAND()*1000000)
END
GO


CREATE OR ALTER PROC dbo.seek_narrow_max
AS
BEGIN
	SELECT ID
	FROM dbo.narrow_max
	WHERE ID = CONVERT(INT,RAND()*1000000)
END
GO

CREATE OR ALTER PROC dbo.seek_narrow_max_NOLOCK
AS
BEGIN
	SELECT ID
	FROM dbo.narrow_max WITH (NOLOCK)
	WHERE ID = CONVERT(INT,RAND()*1000000)
END
GO


CREATE OR ALTER PROC dbo.seek_wide_max
AS
BEGIN
	SELECT ID
	FROM dbo.wide_max
	WHERE ID = CONVERT(INT,RAND()*1000000)
END
GO

CREATE OR ALTER PROC dbo.seek_wide_max_NOLOCK
AS
BEGIN
	SELECT ID
	FROM dbo.wide_max WITH (NOLOCK)
	WHERE ID = CONVERT(INT,RAND()*1000000)
END
GO

And here’s what I used to grab execution stats:

SELECT OBJECT_NAME(object_id) AS [procedure],execution_count,total_worker_time
FROM sys.dm_exec_procedure_stats
WHERE object_id IN (OBJECT_ID('dbo.seek_narrow'),OBJECT_ID('dbo.seek_narrow_NOLOCK'),OBJECT_ID('dbo.seek_wide'),
OBJECT_ID('dbo.seek_wide_NOLOCK'),OBJECT_ID('dbo.seek_narrow_max'),OBJECT_ID('dbo.seek_narrow_max_NOLOCK'),
OBJECT_ID('dbo.seek_wide_max'),OBJECT_ID('dbo.seek_wide_max_NOLOCK'))
ORDER BY [procedure]

And here are my results, measured in executions, averaged over three runs.

Table Normal NOLOCK
narrow 11,059,073 10,530,684
wide 10,770,979 10,838,683
narrow_max 10,821,620 10,533,960
wide_max 10,443,677 10,035,070

The runs had enough variance to them that I’m not confident in saying a particular method is a clear loser, but it should be obvious that NOLOCK is no help here.

And overall I think NOLOCK is a loser. Sure, it’s faster in scans on wide tables, but maybe you should fix the scans and the wide tables, instead of slapping on a bandaid. You know, you could TABLOCK hint for the same effect, without risking correctness.

There’s more I could test, and probably will. It turns out databases are complicated, or something. NOLOCK performance has long interested me – my only (as of this moment) dba.stackexchange question is about a case where NOLOCK is slower. In the end, though, I admit that I’m a database snob. Stop using NOLOCK, and fix your design!

Gotta Cache ‘Em All: Query Store Survey

I have a friend who’s really into Pokemon. Really into Pokemon. I had no idea there was such a thing as a “shiny” pokemon until he, uh, graciously spent a whole evening lecturing me on their value and capture.

It’s a passion I don’t really get, but have to tolerate, maybe even respect. It turns out I’m the same, just with different monsters. You see, I really like plan operators. I like knowing their behavior, I like learning their natural habitat, and I love hunting for exotics. If I could find a way to stuff and mount Flow Distinct on my wall, I probably would.

There a handful of ways to see what operators roam your production environment. If you’re lucky enough to be working with 2016 or later
(I’ve seen the stats though, condolences for those of you still on 2005), you can peek into Query Store. Actually, with how long the query takes to run, it might be more of an expedition than a peek. Caveat executor. 

Here’s the script – again, mind that this can be an expensive query.

DROP TABLE IF EXISTS #QS_xml
GO

CREATE TABLE #QS_xml (
query_plan XML
)

INSERT #QS_xml
SELECT TRY_CONVERT(XML,query_plan) query_plan
FROM sys.query_store_plan qsp
GO


WITH XMLNAMESPACES(DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan') 

,CTE AS (
SELECT x.value('./@PhysicalOp','varchar(200)') AS PhysicalOp, x.value('./@LogicalOp','varchar(200)') AS LogicalOp
FROM #QS_xml qsx
CROSS APPLY qsx.query_plan.nodes('//RelOp') c(x)
)

SELECT CTE.PhysicalOp, CTE.LogicalOp, COUNT(*) AS [Count]
FROM CTE
GROUP BY PhysicalOp, CTE.LogicalOp
ORDER BY [Count] DESC

The most common operator for me, at a whopping quarter-million occurrences, is the Clustered Index Seek. That’s probably a good thing, since I do OLTP. I still see plenty of scans, because developers, and a decent range of not-rare-but-not-common operators, like the Table Spool (hmm, I should probably look into that query…).

Then there’s the esoteric bottom of the list, where operators are only seen a couple of places instead of a quarter-million. One of these was new to me: Inserted Scan, which comes from the Inserted pseudo-table in a trigger. I have a couple oddities like Parameter Table Scan and Hash Match for a Full Outer Join.

Of course, hiding down there are three shiny Flow Distincts. Time for me to hunt down a repro!

Cascading Deletes – Cardinality

Who knew cascading deletes could be so horribly entertaining (emphasis on horrible)? The problems I wrote about traced back to cardinality, but I wasn’t ready to write about it. So, uh, I might have spent most of a perfectly good weekend reading Dmitry and Microsoft patents.

The scan plan was coming from a really bad cardinality assumption: given a Parent ID sufficiently outside the stats histogram, *every* Child table row would join to it. But why was I getting such a ridiculous estimate?

Time for an undocumented trace flag (gosh, I love these)! Dmitry’s post introduced me to trace flag 2363, which outputs cardinality estimation info for the new estimator. I tossed in a QUERYTRACEON, and got the following.

Coooooooool! The optimizer considers stats on Parent.ID to be a badly-formed histogram, whatever that is. Moreover, it just gives up, moves to CSelCalcGuessComparisonJoin, and gives selectivity of 1 (aka everything). Nice guess.

What happens when the ID is within range?

Some huge amount of testing later, I’m able to determine that the bad bailout only occurs when 1) Parent stats are “badly-formed” and 2) the ID is out of range. Stats updated with a FULLSCAN aren’t badly-formed, which seems to be why the problem disappears in that circumstance.

Also interesting is what happens in 2014 compatibility with QO Fixes off.

It ejects, but uses CSelCalcSimpleJoin, with a much more reasonable guess.

I kind of have my answer at this point, which that the reason for crappy cardinality is buried deep within the proprietary bowels of SQL Server’s estimator design. Luckily there was a holiday sale on unsatisfying answers, and I got a new question for free.

What the heck is a “badly-formed histogram?”

Spelunking with windbg left me in the dark, so I resorted to empirical testing with stats donating. The best I can tell, a badly-formed histogram only occurs for a column that is unique and has decimal-valued range_rows. In turn, decimal-valued range_rows only seem to occur for large-enough tables with sampled statistics. How’s that for a summary of weeks of investigation?

To be honest, I’m not really happy with the state of my understanding here. My answers are more anecdote than explanation, only looking at a lower layer of turtle. Maybe if I booze up a Microsoft employee enough I can get them to share info outside of NDA…

Anyways, I’m off to read more patents.