Percentage Non Grata

Percentages? Junk. Yes, this right here:

It’s junk. It’s not the amount of time spent, it’s not the proportion of resources used, it’s not the fraction of rows, it’s junk. There’s a more nuanced explanation, but this is my pet peeve dangit, and I’m going on a repro rant.

Behold, two tables with a single row each.

DROP TABLE IF EXISTS dbo.A
DROP TABLE IF EXISTS dbo.B

CREATE TABLE dbo.A (ID int)
CREATE TABLE dbo.B (ID int)

INSERT dbo.A VALUES (1)
INSERT dbo.B VALUES (1)

Mess with statistics, saying there are 10000 rows in B, and compare the SELECT *s.

UPDATE STATISTICS dbo.B WITH ROWCOUNT = 10000

SELECT * FROM A
SELECT * FROM B

Hey, that actual plan has different percentages even though the table scans take the same time with the same rows…

Recreate the tables to clear the bad stats. This query reads 1 row from A, and 0 from B. But they have the same percents!

SELECT TOP (1) *
FROM (
	SELECT * FROM A
	UNION ALL
	SELECT * FROM B
	) u
100%…each?!

Estimated plans can have problems too. Check this out.

IF (SELECT 1) = 1
SELECT * FROM A
261185% trustworthy

Those cost percentages, despite whatever myth you’ve heard, do not correspond to actual resource usage. What are they? Leftover guesses. SQL Server uses a cost-based optimizer, and those cost numbers come from the optimizer’s guesswork. Get an estimated plan, note the percents, then run with the actual plan, and you’ll see the exact same costs (with usual “It Depends” edge-case disclaimer). They’re leftovers, not even reheated.

And pretty often, you’re looking at a plan because SQL Server guessed wrong. Which means those cost percentages are close to worthless when tuning.

So please, stop it. Percentage misunderstanding is too common, even among Senior DBAs. When we introduce people to execution plans, the best thing to say is that these costs are junk.

Debugging Dataflow

Sometimes I try to write useful, informative posts. This isn’t one of them. Unless, like me, you’re trying to get better at windbg in SQL Server.

I’ve long known that “rows flow from right to left in an execution plan.” But what does it mean? What’s actually happening? I had my suspicions, but I wasn’t able to figure it out by just stepping through the code…there’s a lot of it. The solution finally came when I learned a new type of breakpoint that triggers after memory is accessed.

I start by inserting a row into a table. The pattern “AAAAAAAA” is really easy to spot in hex, so I make sure the row has that value (as an int).

DROP TABLE IF EXISTS dbo.PageTest
GO

CREATE TABLE dbo.PageTest(ID int PRIMARY KEY)

INSERT dbo.PageTest
VALUES (CONVERT(int,0xAAAAAAAA))

Then, grab the page location in memory using DBCC PAGE

DECLARE @db int = DB_ID()
DECLARE @page int
SELECT TOP (1) @page = allocated_page_page_id
FROM sys.dm_db_database_page_allocations(DB_ID(),OBJECT_ID('dbo.PageTest'),NULL,NULL,'DETAILED')
WHERE is_allocated = 1 
AND is_iam_page = 0

DBCC TRACEON(3604)
DBCC PAGE(@db,1,@page,1)

If I examine that memory location in windbg, I can see my row! In my case, the int value begins at 238d16e8064. Let’s set a breakpoint with “ba r1 <memory address>” and select * from table.

Examining the page in memory
Breakpoint on row read

Sure enough, I hit the breakpoint and…oh hey, is that another memory address? I wonder what’s in there after I let the line run (F11).

Hey, that’s my row!

So the row value gets copied to memory. Maybe this value in memory is how the row “gets passed” operator to operator. Let’s prove it!

Run a query with an extra operator. I jump through some hoops to get a separate Filter, because…uh, because I can. Maybe another plan would have been simpler. Whatever.

SELECT *
FROM dbo.PageTest WITH(FORCESCAN)
WHERE ID < 0
OPTION(QUERYTRACEON 9130)

This time I set a breakpoint on that new memory address the row is copied to. Let’s see if the Filter accesses it.

Success! This memory address is how the row is getting passed!

The last access of that address seems to be when row data is getting prepared for send-off (TDS being the clue here).

Hmm, I wonder what happens if I edit the memory value right after the filter looks at it? Hold my beer.

Editing memory
Time to file a support case

Ok, fine. Breaking SQL Server by editing memory isn’t actually that impressive. But it’s fun! And in the process I’ve come up with some questions and experiments that might be interesting to other (read: normal) people. Those exercises will have to wait for another day though, because windbg kept me up wayyyy too late.

Two Easy ASYNC_NETWORK_IO Demos

There are lots of ways to do dumb stuff with SQL Server. Fun, right? If you’re familiar with ASYNC_NETWORK_IO, then you’re likely aware of the dumbness that causes it.

Otherwise, here’s a quick rundown. Client programs ask for data from the database. If SQL Server has data ready to give, but the client isn’t asking for it yet, worker threads stand around tapping their feet, whistling the Jeopardy tune, and producing ASYNC_NETWORK_IO waits.

There are two common reasons for the client to be slower at receiving data than SQL Server is in providing it. The first is a SELECT * kind of situation, where the client is asking for a large (for certain values of large) amount of data. The other reason is when the client slowly processes data as it’s received, commonly row-by-row.

Sound like an “application issue”? Yeah, sure is. Which means you might need some extra proof for your well informed and highly talented developers. Here, have a PowerShell demo.

$qry ="SELECT TOP 10 ROW_NUMBER() OVER(ORDER BY (SELECT 'joe')), REPLICATE('a',8000) FROM master.dbo.spt_values"
Invoke-Sqlcmd -server "." -query $qry | ForEach{Start-Sleep -Seconds 1; $_}

Oh, and in case you don’t have your own, here’s a simple script to grab waits in 20s. Make sure to start it before the PS script.

DROP TABLE IF EXISTS #waits

SELECT *
INTO #waits
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN ('LOGMGR_QUEUE','CHECKPOINT_QUEUE','LAZYWRITER_SLEEP','FT_IFTS_SCHEDULER_IDLE_WAIT','XE_DISPATCHER_WAIT',
'REQUEST_FOR_DEADLOCK_SEARCH','XE_TIMER_EVENT','HADR_FILESTREAM_IOMGR_IOCOMPLETION','DIRTY_PAGE_POLL','SQLTRACE_INCREMENTAL_FLUSH_SLEEP',
'QDS_PERSIST_TASK_MAIN_LOOP_SLEEP','BROKER_TO_FLUSH','SLEEP_TASK','SP_SERVER_DIAGNOSTICS_SLEEP','BROKER_TASK_STOP','HADR_WORK_QUEUE','HADR_LOGCAPTURE_WAIT',
'BROKER_TRANSMITTER','HADR_CLUSAPI_CALL','QDS_ASYNC_QUEUE','REDO_THREAD_PENDING_WORK','HADR_TIMER_TASK')

WAITFOR DELAY '00:00:20'

SELECT ws.wait_type, ws.wait_time_ms-w.wait_time_ms AS wait_ms, ws.waiting_tasks_count-w.waiting_tasks_count AS wait_count
FROM sys.dm_os_wait_stats ws
JOIN #waits w ON ws.wait_type = w.wait_type
WHERE ws.waiting_tasks_count > w.waiting_tasks_count
ORDER BY wait_ms DESC

Maybe you want something a little closer to the application code that’s actually causing the problem. Guess what, you can call .NET objects in PowerShell!

$conn = New-Object -TypeName System.Data.SqlClient.SqlConnection("Server = .;Integrated Security = True")
$qry = "SELECT TOP 1000 ROW_NUMBER() OVER(ORDER BY (SELECT 'joe')), REPLICATE('abcd',2000), REPLICATE('FROM',2000), REPLICATE('TSQL',2000) FROM master.dbo.spt_values"
$cmd = New-Object System.Data.SqlClient.SqlCommand

$conn.Open()

$cmd.CommandText = $qry
$cmd.Connection = $conn

$reader = $cmd.ExecuteReader()

while ($reader.Read()) {$reader[0];Start-Sleep -Milliseconds 5}

$conn.Close()

You probably noticed that my queries include large extra columns. If you run the script without them…

$qry ="SELECT TOP 10 ROW_NUMBER() OVER(ORDER BY (SELECT 'joe')) FROM master.dbo.spt_values"
Invoke-Sqlcmd -server "." -query $qry | ForEach{Start-Sleep -Seconds 1; $_}

…the ASYNC_NETWORK_IO waits disappear, even though the row-by-row processing remains! There’s a reason for this: SQL Server doesn’t actually send results row-at-a-time, but stuffs as many as it can into a packet before shipping it off. Without the fluff, all 10 integers arrive at once, and the script no longer leaves the database waiting.

Now for a conversation with those skilled and deeply knowledgeable developers…

Slow Reports for Query Store

My first “Holy crap!” moment with Query Store was when I enabled it in production and the dang thing promptly cleared the plan cache (as designed, apparently). My second was seeing the next day’s CPU utilization nearly doubled due to spinlock contention (since fixed in a CU). Talk about side effects. In light of this, do you think I have Query Store enabled in production? Yes, yes I do, because it’s just so useful.

The Problem

My other big complaint is that often the built-in reports don’t work. You know, the ones double-clickable in SSMS like “Top Resource Consuming Queries.” All I get is Waiting… that runs forever.

I once let it run for a couple hours with no luck. And killing the query somehow cleared the plan cache too! Holy crap #3. Thankfully I have an AG, and my good-enough workaround was to run the reports on a secondary while I pursued other problems.

Then, one day I had to check Query Store in a lower environment, and behold, I had the same slowness issue, but the report actually finished! Pent up annoyance boiled into action, and I pulled the query, ran it, snagged the actual plan, and caught the surprise culprit.

In all the slow variations I found, the pattern was the same. Repeated access to a TVF packaged with sys.query_store dmvs would consume huge amounts of CPU, most commonly occurring alongside a Row Count Spool. It didn’t even matter if the TVF returned zero rows – it was just the number of access times, averaging around 60ms each on my production system. Multiply that across hundreds of thousands of plans/queries being checked, and you have a recipe for slow.

The solution will be to set up plan guides for the SSMS queries that avoid repeated reads of the TVFs. But why stop here? [Actually, feel free to stop here. There’s nothing but gratuitous nerdery until the fix-it scripts] I have PerfView, and I know how to use it!

Fruitless Investigation

Well that’s interesting – pretty much all CPU went into CopyOutAllPlans. But why stop here? I have windbg, and I pretend to know how to use it!

I learned a new trick in windbg, and discovered that AdvanceToNextPlan was getting called hundreds of thousands of times per TVF execution – again, regardless of whether any rows were returned. My current hypothesis is that this TVF has to scan the entire memory allocation of its object, each and every time. Unfortunately, I wasn’t able to figure out which memory object it falls under, as FREESYSTEMCACHE(‘ALL’) did nothing to the in-memory QS data. I’m trying to improve my windbg skills enough to figure out memory information from debugging, but that may be a ways off.

Solution

It was a little tough getting a plan guide to work, due to a requirement that the query match character for character, but my solution is to pull the exact text from the plan as it’s running. So the fix is to:
1) Open up Top Resource Consuming Queries
2) Run the below script while the report is still spinning
3) Cancel the report and run again.
The process can be repeated for the reports you actually look at, just remember to use a new plan guide name!

--USE YourDB
DECLARE @badqry NVARCHAR(MAX)

SELECT TOP (1) @badqry = SUBSTRING(st.text, (r.statement_start_offset/2) + 1,  
    ((CASE statement_end_offset   
        WHEN -1 THEN DATALENGTH(st.text)  
        ELSE r.statement_end_offset 
	END - r.statement_start_offset)/2) + 1)
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) st
WHERE st.text LIKE '(@results_row_count int,@interval_start_time%'

IF @badqry IS NULL
RAISERROR('Missed the plan',16,1)

EXEC sp_create_plan_guide
@name = N'Fixing QueryStore Top Duration',
@stmt = @badqry,
@type = N'SQL',
@module_or_batch = NULL,
@params = N'@results_row_count int,@interval_start_time datetimeoffset(7),@interval_end_time datetimeoffset(7)',
@hints = N'OPTION (HASH JOIN, LOOP JOIN)'

I’m certain there’s a more efficient combination of hints than (HASH JOIN, LOOP JOIN) to prevent the problem, but these have worked well enough for me so far. Please let me know if you find something better.

There aren’t many folks using Query Store, or even 2016 yet, so I guess I’m paying the early-adopter tax (you’re welcome). But for all the frustrating issues I’ve encountered, I still see a genuinely useful product mixed in. And the “Holy crap!” moments keep pushing me to deepen my skills and write mildly entertaining blog posts wherein I pretend to use windbg.

The Three Physical Joins, Visualized

Less than a month into my first SQL job, I was handed a report that “suddenly stopped working.” I followed the Excel sheet to a data connection to a stored procedure to a SELECT to a view on a view on a view.
The difference between the query that worked and the one that never finished (at least, over the 10 hours I let it run), was a Hash versus a Loop join. It was my first introduction to query plans and to the importance of physical joins.

But what are they? Distinct from logical joins (INNER, OUTER, CROSS, etc.), physical joins are the different algorithms available to actually compare data from different tables. You don’t have to specify which one SQL Server will use, and usually SQL Server gets it right. But there’s a large gap between usually and always, and the fine art of query tuning lives in that gap. Having a good understanding of the joins is essential for mastering tuning (if mastery is even possible), so let’s take a look.

Nested Loops

Overview: The simplest algorithm is Nested Loops. Imagine you have two piles [tables] of values, and you want to find the matches. Take the first value from the first pile, and compare it to each value in the second pile. Then grab the next value from the first pile, and compare to everything in the second. Grab the third value…wash, rinse, repeat. If you write the pseudo-code for the algorithm, it looks something like below.

For Each value in pile1
    For Each value in pile2
        If pile1.value = pile2.value
        Return pile1.value, pile2.value

Hey, it’s one For-Each loop inside of another loop…you might even say, Nested inside of another. Ahem. Anyways, here’s a visualization.

Drawbacks: An important detail of Nested Loops to notice is that values in the bottom (inner) input get read multiple times, which can quickly grow out of control for large tables. This isn’t an issue if an index on that inner table allows seeking to the exact match, but otherwise the amount of work to compare rows increases drastically with size.

Merge Join

Overview: Merge Join allows comparison while reading each input value only once. The catch is that each input has to be sorted (and on the compared values). Roughly, start at the beginning of each input and progress through the values, grabbing the next input value from whichever is lower. I have pseudo-code for Merge, but it didn’t seem helpful, so I’m skipping straight to the picture.

Drawbacks: Remember how I said that Merge reads each input value only once? I lied. There’s a special (and not uncommon) case where that’s not true. If both inputs have repeated values, called Many-to-Many, the algorithm gets really complicated. As in, holy-crap-why-is-it-taking-so-long complicated. But if you can guarantee unique values in at least one input (like my gif above shows), you’re golden. So, Merge has two big gotchas: the many-to-many case, and the requirement for both inputs to be sorted.

Hash Match

Overview: The final algorithm is Hash Match, also known as Hash Join. Like Merge, it only needs a single pass through each input. It’s also the most complicated. Here’s how it works (with minor lies simplifications): there are two phases, called the build phase and probe phase. In the build phase, each value from the top input is hashed. The hash and the value are stored in a bucket. There are multiple buckets, and the hash value determines which bucket the hash and original value will be stored in. Then comes the probe phase. For each value in the second input, that value will be hashed, and the hash compared to hashes in the buckets. If the hash matches, then the values stored are checked to confirm the match. With an inner join, values that match are outputted, and otherwise discarded.

Below is my animated Hash Match, but first a note on my fake-for-learning-purposes-only hash function. I take each input number, spell it out, and take the first letter. For example, 1 becomes “one” becomes “o.” 5 becomes “five” becomes “f.”

Drawbacks: So why isn’t everything a Hash Join? Well, surprise surprise, it has its own weaknesses. The biggest one to consider is that “storing hashes and values in buckets” piece. Storing occurs in memory (hopefully), and memory is a limited resource. And, if the Hash Match needs more memory than it was given, it doesn’t get more, but writes (spills) to disk instead, killing performance.

I would probably summarize an introduction to the joins as below

JoinStrengthsCommonly SeenBlows Up When
Nested LoopsSimple, fast with the right indexes.Small joins with proper indexes.Inner table has lots of rows repeatedly read.
MergeVery fast with sorted inputs.Key to key joins. Query includes ORDER BY.Necessary sorts take too long. Many-to-Many.
HashGenerally fast.Heaps. Large joins.Can’t get enough memory.

The three algorithms each have different strengths, weaknesses, hidden gotchas, and bizarre variants. I’ve fallen in love with the subtleties and complexities that arise from deceptively simple starting points, and it pained me to have to skip over important details (such as Merge and Hash joins requiring an equality comparison). There’s really a lot more to write about, which I intend to do…eventually. In the meantime, if you want to read more, I suggest looking at Craig Freedman’s articles and Hugo Kornelis’ site.

So is NOLOCK Actually Faster?

NOLOCK has a reputation. Several reputations in fact. To the database snob, NOLOCK is a vestigial practice of the untrained masses, a Pavlovian reflex borne [yes, borne] of fossilized ignorance and apathy. Millions of man-hours of coding have gone into building robust ACID databases that guarantee consistency… to be casually thrown away by peons who can’t be bothered to actually fix their design.

For others, NOLOCK is the two-click salvation when the system is down, the company is losing thousands of dollars a minute, and the full weight of the corporate totem pole is pressing down. Incorrect results? Sure, in theory. Never seen one. NOLOCK works, so why not keep using it?

What about another reputation? That NOLOCK is the “go faster” hint? This one can be tested, and I love testing! There are three main ways it might work:
1) because it doesn’t get blocked
2) because it doesn’t have the overhead of taking locks
3) allocation order scans

Using NOLOCK to get around blocking is…I’m trying to think of a word here…dumb. I’m not talking about running an ad-hoc query on a busy production system where exact results don’t matter. I’m talking about adding the hint to EVERYTHING as a matter of practice. If your queries are slow from blocking, you need to fix how much data they’re reading, often through indexes, or move to RCSI. But others have written more on this.

Regarding allocation order scans, they help the most on severely fragmented tables – the ones with a GUID as clustering key for example. They also seem to help with physical reads. In fact, if you search for “NOLOCK performance testing,” the top result shows using the hint is faster, but the reason in that test is the different scan type. But when your data is in memory and not especially fragmented (you know, the way you’d expect when you’re not being bad at databases), the allocation scan doesn’t help as much.

So what about the second scenario? Ignoring blocking, ignoring allocation scans, does the locking overhead make a difference? Let’s find out!

Here’s my setup – four tables wide and narrow, max column and not.

--narrow
CREATE TABLE dbo.narrow (
ID INT PRIMARY KEY,
junk CHAR(1))

--wide table, no MAX
CREATE TABLE dbo.wide (
ID INT PRIMARY KEY,
junk CHAR(7000))

--narrow, MAX
CREATE TABLE dbo.narrow_max (
ID INT PRIMARY KEY,
junk char(1),
bigjunk VARCHAR(MAX))

--wide, MAX
CREATE TABLE dbo.wide_max (
ID INT PRIMARY KEY,
junk CHAR(7000),
bigjunk VARCHAR(MAX))


INSERT dbo.narrow
SELECT TOP 1000000 ROW_NUMBER() OVER(ORDER BY(SELECT 'Obbish')), 'a'
FROM master..spt_values a
CROSS JOIN master..spt_values b

INSERT dbo.wide
SELECT TOP 1000000 ROW_NUMBER() OVER(ORDER BY(SELECT 'Obbish')), 'a'
FROM master..spt_values a
CROSS JOIN master..spt_values b

INSERT dbo.narrow_max
SELECT TOP 1000000 ROW_NUMBER() OVER(ORDER BY(SELECT 'Obbish')), 'a', 'a'
FROM master..spt_values a
CROSS JOIN master..spt_values b

INSERT dbo.wide_max
SELECT TOP 1000000 ROW_NUMBER() OVER(ORDER BY(SELECT 'Obbish')), 'a','a'
FROM master..spt_values a
CROSS JOIN master..spt_values b

I then run the following SELECT statements that scan the tables. The WHERE clause makes sure no rows get returned and sent over the network. I use a warm cache because that’s a bit more realistic for most environments.

SELECT *
FROM dbo.narrow
WHERE junk = 'b'
ORDER BY ID
OPTION(MAXDOP 1)

SELECT *
FROM dbo.narrow WITH(NOLOCK)
WHERE junk = 'b'
ORDER BY ID
OPTION(MAXDOP 1)

SELECT *
FROM dbo.wide
WHERE junk = 'b'
ORDER BY ID
OPTION(MAXDOP 1)

SELECT *
FROM dbo.wide WITH(NOLOCK)
WHERE junk = 'b'
ORDER BY ID
OPTION(MAXDOP 1)

SELECT *
FROM dbo.narrow_max
WHERE junk = 'b'
ORDER BY ID
OPTION(MAXDOP 1)

SELECT *
FROM dbo.narrow_max WITH(NOLOCK)
WHERE junk = 'b'
ORDER BY ID
OPTION(MAXDOP 1)

SELECT *
FROM dbo.wide_max
WHERE junk = 'b'
ORDER BY ID
OPTION(MAXDOP 1)

SELECT *
FROM dbo.wide_max WITH(NOLOCK)
WHERE junk = 'b'
ORDER BY ID
OPTION(MAXDOP 1)

Average scan speeds (in ms) are as follow:

Table Normal NOLOCK
narrow 51 53
wide 794 525
narrow_max 53 52
wide_max 1433 1124

So yes, it looks like NOLOCK is faster, but only on wider tables in a scan situation. Seriously though, if your system is running a lot of scans on really wide tables, you might want to reconsider if speeding them up is your real priority.

What about an OLTP-style test, with a lot of single-row seeks? I used SQL Query Stresser with 40 threads of 10s WHILE loops and the below procs.

CREATE OR ALTER PROC dbo.seek_narrow
AS
BEGIN
	SELECT ID
	FROM dbo.narrow
	WHERE ID = CONVERT(INT,RAND()*1000000)
END
GO

CREATE OR ALTER PROC dbo.seek_narrow_NOLOCK
AS
BEGIN
	SELECT ID
	FROM dbo.narrow WITH (NOLOCK)
	WHERE ID = CONVERT(INT,RAND()*1000000)
END
GO


CREATE OR ALTER PROC dbo.seek_wide
AS
BEGIN
	SELECT ID
	FROM dbo.wide
	WHERE ID = CONVERT(INT,RAND()*1000000)
END
GO

CREATE OR ALTER PROC dbo.seek_wide_NOLOCK
AS
BEGIN
	SELECT ID
	FROM dbo.wide WITH (NOLOCK)
	WHERE ID = CONVERT(INT,RAND()*1000000)
END
GO


CREATE OR ALTER PROC dbo.seek_narrow_max
AS
BEGIN
	SELECT ID
	FROM dbo.narrow_max
	WHERE ID = CONVERT(INT,RAND()*1000000)
END
GO

CREATE OR ALTER PROC dbo.seek_narrow_max_NOLOCK
AS
BEGIN
	SELECT ID
	FROM dbo.narrow_max WITH (NOLOCK)
	WHERE ID = CONVERT(INT,RAND()*1000000)
END
GO


CREATE OR ALTER PROC dbo.seek_wide_max
AS
BEGIN
	SELECT ID
	FROM dbo.wide_max
	WHERE ID = CONVERT(INT,RAND()*1000000)
END
GO

CREATE OR ALTER PROC dbo.seek_wide_max_NOLOCK
AS
BEGIN
	SELECT ID
	FROM dbo.wide_max WITH (NOLOCK)
	WHERE ID = CONVERT(INT,RAND()*1000000)
END
GO

And here’s what I used to grab execution stats:

SELECT OBJECT_NAME(object_id) AS [procedure],execution_count,total_worker_time
FROM sys.dm_exec_procedure_stats
WHERE object_id IN (OBJECT_ID('dbo.seek_narrow'),OBJECT_ID('dbo.seek_narrow_NOLOCK'),OBJECT_ID('dbo.seek_wide'),
OBJECT_ID('dbo.seek_wide_NOLOCK'),OBJECT_ID('dbo.seek_narrow_max'),OBJECT_ID('dbo.seek_narrow_max_NOLOCK'),
OBJECT_ID('dbo.seek_wide_max'),OBJECT_ID('dbo.seek_wide_max_NOLOCK'))
ORDER BY [procedure]

And here are my results, measured in executions, averaged over three runs.

Table Normal NOLOCK
narrow 11,059,073 10,530,684
wide 10,770,979 10,838,683
narrow_max 10,821,620 10,533,960
wide_max 10,443,677 10,035,070

The runs had enough variance to them that I’m not confident in saying a particular method is a clear loser, but it should be obvious that NOLOCK is no help here.

And overall I think NOLOCK is a loser. Sure, it’s faster in scans on wide tables, but maybe you should fix the scans and the wide tables, instead of slapping on a bandaid. You know, you could TABLOCK hint for the same effect, without risking correctness.

There’s more I could test, and probably will. It turns out databases are complicated, or something. NOLOCK performance has long interested me – my only (as of this moment) dba.stackexchange question is about a case where NOLOCK is slower. In the end, though, I admit that I’m a database snob. Stop using NOLOCK, and fix your design!

Gotta Cache ‘Em All: Query Store Survey

I have a friend who’s really into Pokemon. Really into Pokemon. I had no idea there was such a thing as a “shiny” pokemon until he, uh, graciously spent a whole evening lecturing me on their value and capture.

It’s a passion I don’t really get, but have to tolerate, maybe even respect. It turns out I’m the same, just with different monsters. You see, I really like plan operators. I like knowing their behavior, I like learning their natural habitat, and I love hunting for exotics. If I could find a way to stuff and mount Flow Distinct on my wall, I probably would.

There a handful of ways to see what operators roam your production environment. If you’re lucky enough to be working with 2016 or later
(I’ve seen the stats though, condolences for those of you still on 2005), you can peek into Query Store. Actually, with how long the query takes to run, it might be more of an expedition than a peek. Caveat executor. 

Here’s the script – again, mind that this can be an expensive query.

DROP TABLE IF EXISTS #QS_xml
GO

CREATE TABLE #QS_xml (
query_plan XML
)

INSERT #QS_xml
SELECT TRY_CONVERT(XML,query_plan) query_plan
FROM sys.query_store_plan qsp
GO


WITH XMLNAMESPACES(DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan') 

,CTE AS (
SELECT x.value('./@PhysicalOp','varchar(200)') AS PhysicalOp, x.value('./@LogicalOp','varchar(200)') AS LogicalOp
FROM #QS_xml qsx
CROSS APPLY qsx.query_plan.nodes('//RelOp') c(x)
)

SELECT CTE.PhysicalOp, CTE.LogicalOp, COUNT(*) AS [Count]
FROM CTE
GROUP BY PhysicalOp, CTE.LogicalOp
ORDER BY [Count] DESC

The most common operator for me, at a whopping quarter-million occurrences, is the Clustered Index Seek. That’s probably a good thing, since I do OLTP. I still see plenty of scans, because developers, and a decent range of not-rare-but-not-common operators, like the Table Spool (hmm, I should probably look into that query…).

Then there’s the esoteric bottom of the list, where operators are only seen a couple of places instead of a quarter-million. One of these was new to me: Inserted Scan, which comes from the Inserted pseudo-table in a trigger. I have a couple oddities like Parameter Table Scan and Hash Match for a Full Outer Join.

Of course, hiding down there are three shiny Flow Distincts. Time for me to hunt down a repro!

Cascading Deletes – Cardinality

Who knew cascading deletes could be so horribly entertaining (emphasis on horrible)? The problems I wrote about traced back to cardinality, but I wasn’t ready to write about it. So, uh, I might have spent most of a perfectly good weekend reading Dmitry and Microsoft patents.

The scan plan was coming from a really bad cardinality assumption: given a Parent ID sufficiently outside the stats histogram, *every* Child table row would join to it. But why was I getting such a ridiculous estimate?

Time for an undocumented trace flag (gosh, I love these)! Dmitry’s post introduced me to trace flag 2363, which outputs cardinality estimation info for the new estimator. I tossed in a QUERYTRACEON, and got the following.

Coooooooool! The optimizer considers stats on Parent.ID to be a badly-formed histogram, whatever that is. Moreover, it just gives up, moves to CSelCalcGuessComparisonJoin, and gives selectivity of 1 (aka everything). Nice guess.

What happens when the ID is within range?

Some huge amount of testing later, I’m able to determine that the bad bailout only occurs when 1) Parent stats are “badly-formed” and 2) the ID is out of range. Stats updated with a FULLSCAN aren’t badly-formed, which seems to be why the problem disappears in that circumstance.

Also interesting is what happens in 2014 compatibility with QO Fixes off.

It ejects, but uses CSelCalcSimpleJoin, with a much more reasonable guess.

I kind of have my answer at this point, which that the reason for crappy cardinality is buried deep within the proprietary bowels of SQL Server’s estimator design. Luckily there was a holiday sale on unsatisfying answers, and I got a new question for free.

What the heck is a “badly-formed histogram?”

Spelunking with windbg left me in the dark, so I resorted to empirical testing with stats donating. The best I can tell, a badly-formed histogram only occurs for a column that is unique and has decimal-valued range_rows. In turn, decimal-valued range_rows only seem to occur for large-enough tables with sampled statistics. How’s that for a summary of weeks of investigation?

To be honest, I’m not really happy with the state of my understanding here. My answers are more anecdote than explanation, only looking at a lower layer of turtle. Maybe if I booze up a Microsoft employee enough I can get them to share info outside of NDA…

Anyways, I’m off to read more patents.

Cascading Delete Troubles

I had an interesting conversation with a developer the other day – it went something like this:

Developer: “Why does it take three minutes to delete a single row from this table?”

Me: “Hold on while I look at the plan. Oh, it’s because SQL Server is checking the 460 foreign keys that reference that table, 447 of which don’t have a supporting index.”

I know the pains of foreign keys without indexes. Oh, do I know. So when I saw a cascading single-row delete plan with a scan on the child table, I thought, “Aha! I know how to fix this, an index!”

But the index was already there, a perfect one for supporting that foreign key with the abominable ON DELETE CASCADE . Something was obviously wrong, and I immediately suspected our recent switch to 2016 compatibility level – meaning a new cardinality estimator.

Sure enough, in 2012 mode the query optimizer produced a seek.

I was able to come up with a repro of the issue (which is what the above plans are from).

--set up two tables, one with a foreign key with cascading delete
DROP TABLE IF EXISTS dbo.Child
DROP TABLE IF EXISTS dbo.Parent
GO

CREATE TABLE dbo.Parent (
ID INT IDENTITY PRIMARY KEY CLUSTERED,
junk CHAR(1)
)

INSERT dbo.Parent
SELECT TOP 1000000 'a'
FROM master..spt_values a
CROSS JOIN master..spt_values b

CREATE TABLE dbo.Child (
ID INT IDENTITY PRIMARY KEY CLUSTERED,
morejunk CHAR(5),
ParentID INT
)

ALTER TABLE dbo.Child
ADD CONSTRAINT [đź’©] FOREIGN KEY (ParentID)
REFERENCES dbo.Parent (ID) 
ON DELETE CASCADE
ON UPDATE CASCADE

CREATE INDEX IX_agggghhhhhh ON dbo.Child(ParentID)

INSERT dbo.Child
SELECT TOP 1100000 'bah!', 42 --repro also works with random numbers for FK column 
FROM master..spt_values a
CROSS JOIN master..spt_values b
GO


-----------------------------------------------------------------
--Just need estimated plan here. Does it scan on Child?
DELETE dbo.Parent
WHERE ID = -1  --any value here too low or too high will cause a scan, often including values that do exist in Parent!

The short of it is, with any value too far outside of the histogram of the parent table, the new cardinality estimator assumes that EVERY SINGLE ROW of the child table will get deleted.

This can be seen even when a seek is forced through a USE PLAN hint.

Interestingly enough, running the new cardinality estimator in 2014 compatibility level with Query Optimizer Fixes off will also produce a seek plan. I found a few other ways to get the good plan, and stuck them in a bonus script.

DELETE dbo.Parent
WHERE ID = -1
OPTION(
USE HINT('FORCE_LEGACY_CARDINALITY_ESTIMATION')
)

DELETE dbo.Parent
WHERE ID = -1
OPTION(
USE HINT('ASSUME_JOIN_PREDICATE_DEPENDS_ON_FILTERS')
)
--legacy CE with base containment (TF2301) still seeks though


--Repro is dependent on statistics
SELECT * FROM sys.dm_db_stats_properties_internal(OBJECT_ID('dbo.Parent'),1)
SELECT * FROM sys.dm_db_stats_histogram(OBJECT_ID('dbo.Parent'),1)
SELECT * FROM sys.dm_db_stats_histogram(OBJECT_ID('dbo.Child'),2)

UPDATE STATISTICS dbo.Parent WITH SAMPLE 5 PERCENT  --allows problem
UPDATE STATISTICS dbo.Parent WITH FULLSCAN  --changes histogram and prevents problem
UPDATE STATISTICS dbo.Parent WITH SAMPLE 99 PERCENT  --still allows problem

There’s something I don’t understand happening with the cardinality estimator here (as is usual). It seems when the value being deleted from the Parent table is within its histogram, the optimizer uses stats from the Child. Otherwise, it assumes the entire Child table matches. Ouch.

Remember how my demo only has two tables? The real scenario I encountered was a cascading delete hitting forty tables, and, you guessed it, scanning each one. Block party!

Forcing plans with the help of Query Store only goes so far, especially when it’s playing whack-a-mole with ORM-generated SQL. The best solution for now might be to retreat to 2012 compatibility and complain on the internet. Bah!

Debugging for Trace Flags

Good thing I have a test machine, because I just crashed SQL Server again. Prepare yourselves for (yet another) post about doing horrible things to the database for fun and profit more fun.

Obligatory “blah blah serious warning blah blah DON’T DO THIS IN PRODUCTION blah blah or you’ll go blind blah blah.” Ok, let’s get started! I have a documented obsession with undocumented trace flags and have long wanted to use the debugger to discover new ones. I just didn’t comprehend what I saw in windbg until I learned about registers and how a program actually runs. Thanks, Crash Course Computing!

Armed with an incomplete understanding of fundamentals and a googled list of windbg commands, I hypothesized that if I turned on some trace flags for a query, they would eventually be passed through one of the registers. From there, I could work out the mechanisms involved and maybe even how to find new flags!

Running a query with a ridiculously long compile time let me capture optimization call stacks with PerfView. The common root of those seemed to be sqllang!CSQLSource::Transform, so I set a breakpoint on it. F11 let me step into the various classes [I don’t even know if I’m using the right terms here] called, and I eventually hit a lucky break.

Behold, sqldk!CSessionTraceFlags::CheckSessionTraceInternal

Clues just don’t get more obvious than that.

Setting a breakpoint on that function [module? class?] revealed that it gets called a lot. I kept F11-ing through the function while watching the registers. I had turned on several trace flags and was looking for their hex forms, but didn’t see them. Eventually, I ran across a call to CheckSessionTraceInternal originating from sqllang!CExecLvlIntCtxt::FDontExpandPersistedCC. “Aha!” I thought (and might have said out loud) “I know what trace flag prevents expanding persisted computed columns.” A glance at my favorite list of trace flags confirmed what I remembered; I was looking at TF 176. The windows calculator has a handy hex converter in its “Progammer” setting, and I quickly determined I would be looking for “b0” in a register.

Bingo.

The pattern soon became obvious. CheckSessionTraceInternal is called whenever a function checks whether a trace flag is turned on. Luckily, when CheckSessionTraceInternal is called, the value of the trace flag it’s checking is passed through register rdx (I think technically edx, but don’t ask me the difference yet) and then stored in r10.

I spent a dozen-plus hours hitting breakpoints on CheckSessionTraceInternal, and made a long list of interesting flags along with the function [method?] checking the flag. I even saw my new friend 8672.

Well, that was the long rambling version of how I discovered this. TLDR is that a breakpoint on sqldk!CSessionTraceFlags ::CheckSessionTraceInternal exposes the trace flag being checked in a register. And of course, Paul White found it first.