Reasoning about Indexed View Locking

Edit: Paul has pointed out some significant gaps, so I will be reworking this post in the future. Locks are accurate for this particular scenario, but apparently there are other behaviors to consider.

Sometimes I hit a weird query where I wonder if, just maybe, an indexed view would help. Then a tingling sensation crawls down my spine and I run screaming in terror, because indexed views have a reputation. Ghastly deadlocks, phantom issues, and sinister edge cases haunt my DBA sensibilities, scaring me away.

But I prefer understanding – I prefer science. So I investigated and experimented, poked and prodded. Great thanks to Paul White and Erik Darling. Erik somewhere spelled out the key idea: one/many-to-many joins in the view cause serializable locks.

My setup script (full script at the bottom) creates two identical tables A and B of even numbers 2 to 10, with an indexed view AB that joins on A>B in a one-to-many relationship.

CREATE OR ALTER VIEW dbo.AB
WITH SCHEMABINDING
AS
SELECT A.ID AS AID, COUNT_BIG(*) AS bigcount
FROM dbo.A
JOIN dbo.B
	ON B.ID < A.ID
GROUP BY A.ID

A modification to a table here has three objects involved, with associated locks: the modified table, the joined table, and the view. First, the locks associated with the original modification are taken, then, the locks on the other table as the engine figures out which rows in the view are affected, and finally the rows in the view.

But that was hard to keep all in my head, so I drew it out. Here are the objects (keys only).

A is the “one” side of the one-to-many, so a modification there involves “normal” locks. But a modification to B, since it can be referenced by multiple rows of A, will involve the super special magical serializable locks.

Let’s look at an insert:

The row inserted into B has an X lock. Next, it will need to search through A to find matching rows of AB, meaning transient S locks in A.

The S locks are released after rows are read, but when the engine arrives at AB, we see range locks (and please note the order of S and Range locks can be interleaved, but I separate them for simplification).

But why range locks? Because the many-ness of the relationship means there are potentially conflicting inserts that wouldn’t be covered by just key locks.

Yikes, no wonder these produce deadlocks! Also, the engine is taking more range locks on the the view than it strictly needs to – probably to prevent complicated programming around edge cases.

Maybe it’s just how my brain works, but drawing these out is what finally made the locking strategy click for me. I don’t consider indexed view locking as mysterious anymore (yay science!), but I do find it monstrous. Be warned!

--Setup
CREATE TABLE A (ID INT PRIMARY KEY, junk CHAR(10))
CREATE TABLE B (ID INT PRIMARY KEY, junk CHAR(10))

INSERT A
SELECT TOP (5) 
(ROW_NUMBER() OVER(ORDER BY 1/0))*2,
'data'
FROM sys.columns

INSERT B
SELECT TOP (5) 
(ROW_NUMBER() OVER(ORDER BY 1/0))*2,
'data'
FROM sys.columns
GO

CREATE OR ALTER VIEW dbo.AB
WITH SCHEMABINDING
AS
SELECT A.ID AS AID, COUNT_BIG(*) AS bigcount
FROM dbo.A
JOIN dbo.B
	ON B.ID < A.ID
GROUP BY A.ID
GO

CREATE UNIQUE CLUSTERED INDEX CX_AIDs ON dbo.AB (AID)
GO

Reverse Engineering the Key Lookup Cost Formula

During a recent investigation into Key Lookup costing, I discovered something interesting. Perhaps a warning is due: I’m a nerd, and things I find “interesting” may cure insomnia in others. But, dear reader, if you’re taking time to read a database blog, you just might be a nerd as well.

Examining a key lookup, they appear to have a cost of 0.0001581 CPU and 0.003125 IO. However, running the math over several experiments showed that the per lookup cost decreased with the number of lookups. But how much? What’s the formula?

Gathering Data

I didn’t know if I would need a complicated curve fitting approach, but I did know the more data points the better. I also knew that working through hundreds of queries and manually recording query buck costs would suck and I didn’t want to do it. But how could I automate pulling costs from execution plans of hundreds of queries? Oh. Right. It’s PowerShell time isn’t it?

Step 1 of course had to be SQL setup. I used a thousand-page table with one row per page.

CREATE TABLE Testable (
ID1 INT IDENTITY PRIMARY KEY,
ID2 INT,
filler CHAR(8000)
INDEX IX_testable_ptooie (ID2)
)

INSERT Testable
SELECT TOP(1000)
ROW_NUMBER() OVER(ORDER BY 1/0),
'junk'
FROM sys.columns

Step 2 was getting the estimated execution plan using PowerShell. Step 3 was getting the PowerShell to actually work. Step 4 was using XPath inside a PowerShell script, so you can guess what steps 5-20 were debugging. Blech…XML schemas. Special thanks to Rob Farley and Jonathan Keyhias for sharing useful code.

function Get-SqlPlan([string] $query){
return ([xml](invoke-sqlcmd -MaxCharLength 9999999 -Server "." -Database "TestDB" -Query ";set showplan_xml on;`ngo`n $query;`ngo").Item(0));
}

$results = @{}

for ($i = 1; $i -le 1000; $i++) {

$query = "SELECT *
FROM Testable WITH(INDEX(IX_testable_ptooie))
WHERE ID2 <= $i
OPTION(MAXDOP 1)"

$t = Get-SqlPlan($query)
$ns = new-object 'System.Xml.XmlNamespaceManager' $t.NameTable;		
$ns.AddNamespace("ns", "http://schemas.microsoft.com/sqlserver/2004/07/showplan");
$results.add($i,$t.SelectNodes('//ns:RelOp[@LogicalOp="Clustered Index Seek"]',$ns).EstimatedTotalSubtreeCost)
}

$results.GetEnumerator()| export-csv -NoTypeInformation -Path "C:\Temp\Key Test.csv" 

Reverse Engineering

I was left with a CSV full of costing info for a thousand key lookups.

Time for some Excel analysis. (Look Ma, I’m a Data Scientist!) First, I calculated the additional cost of each key lookup. Quick tests confirmed that additional CPU cost was constant, but only IO cost decreased. This makes sense, because as key lookups continue, there’s a chance that the page needed was already pulled into the buffer pool.

Yup, weird results for the second through fourth lookups

The decrease obviously wasn’t linear, so I decided to try an approach from statistics. There’s a class of related problems including buckets and balls, and matching birthdays. In the case of a thousand-page table, the second lookup should occur on a random page with a 999/1000 chance to be different from the first lookup’s page. The third lookup should have 999/1000 chance to be different from the first, and 999/1000 chance to be different from the second, so (999/1000)^2. Generalizing, I’d expect the Nth lookup to have a (999/1000)^(N-1) chance of needing a new page (and thus an IO).

Let’s plug in the formula and see…

Almost…but for some reason, the empirical exponent turned out to be N-2 instead of N-1. I can’t come up with why, but whenever I see weird exponents in statistics the explanation is something something *handwaving* degrees of freedom.

There are still two other quirks. The first is that lookups 2, 3, and 4 don’t follow the pattern – not even close. I have no idea why this is the case. The other quirk shows up when looking at formula versus empirical lookup costs for high lookups.

Marginal IO costs only approximately follow the formula here, with a lot of repeated values. My guess is that doing a heap of math isn’t very efficient when SQL Server is trying to quickly determine a plan, so it uses saved values, approximations, or other tricks.

Math

Anyways, here’s what I learned about Key Lookup costs: the CPU component is always 0.0001581 per lookup. IO costs are capped at 0.003125 times the number of pages in the table. The extra IO cost of the Nth lookup is approximately (see prior caveats) (PageCount-1/PageCount)^(N-2). Or, using my fancy math font program:

fancy program ran out of room to write the part about max page costs…

So there you have it, something about SQL Server with no practical value. Can you believe I spent vacation time doing this? Actually, maybe I can troll some MS folks by saying they got their estimation math wrong. I do believe the people who wrote the estimation (and who are much smarter than I am) had some really good reasons, but hey, why let that get in the way of inflammatory tweeting?

Parse Time vs Compilation Time

I’ve been messing around with long compilations recently, and a question was lingering in the back of my mind, based off of SET STATISTICS TIME ON…

Is there a difference between parse time and compile time? This led me to science time, aka beating up SQL Server and taking its secrets.

First, let’s build a proc that contains a very simple query but a gigabyte of whitespace. Errr, whoops…

Fine, just half a gig of whitespace:

DECLARE @w VARCHAR(MAX) = '        ' --8 spaces

SET @w += @w --2^4
SET @w += @w --2^5
SET @w += @w --2^6
SET @w += @w --2^7
SET @w += @w --2^8
SET @w += @w --2^9
SET @w += @w --2^10 = 1024, 1KB
SET @w += @w --2^11
SET @w += @w --2^12
SET @w += @w --2^13
SET @w += @w --2^14
SET @w += @w --2^15
SET @w += @w --2^16
SET @w += @w --2^17
SET @w += @w --2^18
SET @w += @w --2^19
SET @w += @w --2^20 = 1MB
SET @w += @w --2^21
SET @w += @w --2^22
SET @w += @w --2^23
SET @w += @w --2^24 
SET @w += @w --2^25
SET @w += @w --2^26
SET @w += @w --2^27
SET @w += @w --2^28 
SET @w += @w --2^29 = 512MB
--SET @w += @w --2^30 = 1GB


DECLARE @sql VARCHAR(MAX) = '
CREATE OR ALTER PROC Parsimonious
AS
SELECT TOP(1)'+@w+' *
FROM dbo.Comments
OPTION(RECOMPILE)
'

EXEC(@sql)

This is enough to prove parse time <> compile time, as they show up with different numbers. STATISTICS TIME captures the full duration.

While the execution plan shows compile time as less. (Also, the Extended Events that show compile time matched the execution plan time.)

sys.query_store_query contains columns related to this, with more than just the execution plan compilation number, but those don’t show the full time either.

In summary, parsing is different from compilation, and consumes CPU that doesn’t show up anywhere except STATISTICS TIME. This would make a server spending all its compute on parsing harder to detect, though I imagine it’s rare. I can’t say it never happens though, because I’ve seen things in production. Seen things.

How to Measure Compilation Impact

Of the many instances I’ve seen, it’s a rather rare scenario, where sustained compiles/recompiles contribute substantially to CPU load. The path from clues to conviction was a long one, so buckle up if you want to follow along – this is a lengthier than average post.

First Signs of a Compilation Problem

I’ve found the real-world issue twice, and it’s been two different sources of information that convinced me to look at compilations. The easiest, of course, is the compilations counter. Most monitoring tools will track it for you, but in case you want to check for yourself, here’s a sampling script:

DECLARE @comp TABLE (counter_name varchar(100), cntr_value bigint)

INSERT @comp
SELECT counter_name, cntr_value
FROM sys.dm_os_performance_counters
WHERE counter_name LIKE '%compila%'

WAITFOR DELAY '00:00:30'

SELECT pc.counter_name, (pc.cntr_value-c.cntr_value)/30 AS val_per_sec
FROM sys.dm_os_performance_counters pc
JOIN @comp c ON c.counter_name = pc.counter_name
WHERE pc.counter_name LIKE '%compila%'

Another type of clue actually comes from absence. My experience here is from an Azure SQL DB with high CPU but only moderate compilations. However, the top CPU consuming queries were only accounting for a fraction of observed compute. Let me give an example – suppose you have a 4 vCore SQL DB running at around 80% CPU utilization. You’d expect that to be 80% * (available core seconds) = .8*4*3600 = 11520 seconds of CPU time in the last hour. You can check Query Store to get a sense of how much execution compute there is.

You can also query the query store tables directly to find total CPU, but this is left as an exercise for the reader.

Now, the CPU gap could be coming from several sources missed in a cursory analysis, such as single-use plans that Query Store fails to aggregate. But if you’re only finding a fraction of the CPU you expect, there’s something worth investigating.

Of course, I would be remiss in ignoring the plan cache as corroborating evidence. Huge numbers of single-use plans (or duplicate query hashes) are the thing to look for, but it’s not definitive. Some joker may have decided the best approach to parameter sniffing is to plaster OPTION(RECOMPILE) on everything and keep it from being saved in the cache. Yes I’m bitter.

Proving It

Alright – there’s smoke, but is there fire? This is where it gets hard. Here were my real-world scenarios: a 2014 physical server with 5-10 thousand compilations/sec (!) and a 6 vCore SQL DB with a huge amount of missing CPU in Query Store but only(?) about 50 compiles/sec.

Dead End: Query Store

Query Store actually includes some info about compilation in sys.query_store_query. Of course, this wasn’t even an option for the archaic 2014 instance, but on Azure, I tried checking the most recently executed queries and adding up their compilation impact. Unfortunately, this didn’t work – it was orders of magnitude too low. It might have been somewhat due to an AUTO vs ALL capture mode but….it was still a little bit hacky, and the table clearly wasn’t designed for this.

Also, in the case of my Azure SQL DB, the heavy compilation was regularly knocking Query Store into READ ONLY mode, so when I went to find out the problem, there was no recent history. Oops.

XE Struggles

The answer will have to be found with Extended Events, but this is far from trivial. In 2014, there is a *single* event that can capture all compilation CPU, and it’s a chunky one – I mean, we’re talking morbidly obese, mounds of blubber draped over a WalMart scooter chunky: query_post_compilation_showplan.

It’s even one of a handful that MS stapled a warning to: “significant performance overhead.” And of course there’s no way to exclude the query plan from the output, so with each event you could be dealing with huge XML (especially for those complex queries taking a while to compile).

Oh, and the XE description is wrong as well – cpu_time is in milliseconds, not micro.

derp

Azure Efforts

Still, it was the only option available. I went ahead and tried it for the Azure DB since compilations were lower. I didn’t want to wait weeks for approval to set up a storage account, so I just dumped the output into the ring buffer. Oh look, there’s no way to easily view XE output on an Azure SQL DB…

I didn’t want to leave a high-impact XE running, so I just saved the XML into a file, then used PowerShell to parse and aggregate it. It…worked. Mostly. And was painful. Don’t do it this way. It also turned out that the plans were so large that the ring buffer could only hold a couple seconds worth at a time. The data was still enough to prove a very high compilation impact, and let me send likely culprits to a developer buddy.

XE On-Premise

Given my lukewarm success in Azure, I decided to try a similar approach for my on-premise 2014 server, except I would be able to view live data for the XE, so maybe I could push it out to 10+ seconds for a better sample?

I waited until a low activity time, with “just” a couple thousand compiles a second, fired up the XE, and…

SSMS crashed. I tried an even smaller dataset and other tricks, but SSMS just couldn’t handle the sample size needed. I had to find a new approach.

My nemesis was the enormous query plan…if only I could ignore it. My “Aha!” moment came when I realized I could.

The histogram target bucketizes output by a single attribute, so my cheat was to aggregate on cpu_time for a thirty second sample, then math the total CPU spent compiling. For example, if there were 2 queries that took 10s to compile, and 10000 that took 20ms, this would mean that 2*10+10000*.02 = 220s were spent compiling, or 7.3 cores on average in a 30s sample. Here’s the script I came up with. (Special thanks to Kendra Little for the TSQL to query a histogram)

--CREATE EVENT SESSION [Comp_hist] ON SERVER
--ADD EVENT sqlserver.query_post_compilation_showplan
--ADD TARGET package0.histogram(SET filtering_event_name=N'sqlserver.query_post_compilation_showplan',slots=(1024),source=N'cpu_time',source_type=(0))
--WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_MULTIPLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=OFF,STARTUP_STATE=OFF)
--GO

SET XACT_ABORT ON

IF OBJECT_ID('tempdb..#hist') IS NOT NULL DROP TABLE #hist
--DROP TABLE IF EXISTS #hist

ALTER EVENT SESSION [Comp_hist] ON SERVER  
STATE = start;  
GO  

DECLARE @start DATETIME2(1) = GETDATE()

WAITFOR DELAY '00:00:30'

--thanks kendra!
SELECT
    xed.slot_data.value('(value)[1]', 'int') AS cpu_ms,
    xed.slot_data.value('(@count)[1]', 'int') AS slotcount
INTO #hist
FROM (
    SELECT
        CAST(xet.target_data AS xml)  as target_data
    FROM sys.dm_xe_session_targets AS xet  
    JOIN sys.dm_xe_sessions AS xe  
       ON (xe.address = xet.event_session_address)  
    WHERE xe.name = 'Comp_hist'
        and target_name='histogram'
    ) as t
CROSS APPLY t.target_data.nodes('//HistogramTarget/Slot') AS xed (slot_data);

DECLARE @end DATETIME2(1) = GETDATE()

ALTER EVENT SESSION [Comp_hist] ON SERVER  
STATE = stop;

SELECT SUM(cpu_ms*slotcount)*1.0/DATEDIFF(millisecond,@start,@end) AS [Avg cores compiling]
FROM #hist

The script was an utter success – not only was the overhead minimal, it captured the data I needed to prove that our largest, priciest server was spending 40% of its compute on plan generation.

Happy Ending

My developer contact really ran (sprinted?) with the info from the Azure DB, and fixed the issues behind the biggest compilation offenders (from Entity Framework). Average CPU dropped from 70% to 20%, such that we can scale down and save money. For the on-premise server, the 40% number was enough to convince developers to don their hazmat suits and make changes to the problematic legacy code. I haven’t heard anything back yet, which might be because the legacy code ate them – it’s that bad.

Better Solution?

I’m still not content with my current approach. For one, Azure SQL DB doesn’t allow histogram targets, so it just doesn’t work there. Also, the XE I use is normally high impact, so I’m still wary of using it in a histogram. After some digging, I found a Debug XE (remember, Debug is the type you have to specifically check in SSMS) that looks promising: sql_statement_post_compile. I’m not sure when it was added, but I see it for Azure and 2019, but not 2014. Next time I’m checking Azure, I’ll test that one out instead.

So in summary, use sql_statement_post_compile if you have access to it, use a histogram if you’re not Azure SQL DB, and don’t let EF generate giant IN lists of hardcoded values that take 300ms to compile. Ick, why did I end with something about EF? Happy thoughts. Happy thoughts! WHY EF WHYYYYYYY????!!!!!!

Giant IN Lists of Doom

I like to make fun of both Entity Framework and junior developers for the terrible SQL they often produce. Sometimes, though, there’s the rampaging Godzilla of an ORM written by a junior developer baked into irreplaceable stage 4 legacy code. Guess where I got my inspiration?

Testing Limits

Today’s antipattern is the enormous IN list (or WHERE clause), and the first and obvious (and entertaining) question is, is there a limit to the number of IN values. Thankfully this is easy to test, and the answer is…

Aww dangit I broke SSMS.

paging @SSMSCrashedAgain

Ok let’s try this a different way – the limit answer is…

Kind of. There’s not a hard coded ceiling, but rather, you can run the parser/compiler out of memory. For me the limit is somewhere above a million values.

In the spirit of Hold My Beer demos, I created a proc to do the testing for me:

CREATE OR ALTER PROC dbo.Hardcoded_Doom(@i INT)
AS
BEGIN

DECLARE @sql NVARCHAR(MAX)
DECLARE @InList NVARCHAR(MAX)

;WITH CTE AS (
	SELECT TOP(@i) 
		1 Num --(ROW_NUMBER() OVER(ORDER BY 1/0)) AS Num
	FROM dbo.Votes a,
	dbo.Votes b
	)

SELECT @InList = STRING_AGG(CONVERT(VARCHAR(MAX),Num),',')
FROM CTE

SET @sql = '
SELECT COUNT(*)
FROM dbo.Votes
WHERE ID IN ('+@InList+')
OPTION(RECOMPILE)
'

EXEC sp_executesql @sql

END

To minimize query size, the IN list is nothing but the value 1 repeated, but there’s something interesting here: SQL Server takes the time to remove duplicates from the IN list. Also, at a certain number of distinct values (65 for me), SQL Server starts separating them into a Constant Scan.

Duplicate removal implies sorting the values, which in turn implies memory consumption.

Literal values aren’t the only way to make IN lists however – I once encountered an EF turd that ran into a system limit by making every single IN list value its own variable. Let’s do that too!

CREATE OR ALTER PROC dbo.Variable_Doom(@i INT)
AS
BEGIN

DECLARE @sql NVARCHAR(MAX)
DECLARE @VarList NVARCHAR(MAX)
DECLARE @InVarList NVARCHAR(MAX)


;WITH CTE AS (
	SELECT TOP(@i) 
		'@'+CONVERT(VARCHAR(10),(ROW_NUMBER() OVER(ORDER BY 1/0))) AS VarNum,
		1 AS Num
	FROM dbo.Votes a,
	dbo.Votes b
	)

SELECT @VarList = STRING_AGG(CONVERT(VARCHAR(MAX),VarNum),' INT = 1,'),
@InVarList = STRING_AGG(CONVERT(VARCHAR(MAX),VarNum),',')
FROM CTE

SET @VarList = 'DECLARE '+@VarList + ' INT = 1'

SET @sql = @VarList + N'
SELECT COUNT(*)
FROM dbo.Votes
WHERE ID IN ('+@InVarList+')
OPTION(RECOMPILE)
'

EXEC sp_executesql @sql

END
Wheeeeeeee!!!

Compilation Time

Good DBAs will remind you that just because you can do something doesn’t mean you should. Ignore them, this is science. Another fun side effect of these bad queries is high compilation cost. A simple query against Votes with a 100 value IN list (using distinct values) took 3ms to compile. 1000 distinct values took 39ms and 10000 values 640ms. Interestingly, 10000 identical IN list values took only 115ms to compile. It looks like the number of distinct values affects compilation time.

Large IN lists make already complicated views even worse to compile. sys.tables is actually a complicated view, so let’s demo with it. The below compiles in under 50ms for me:

SELECT *
FROM sys.tables t
JOIN sys.columns c
	ON c.object_id = t.object_id

But add a 1000 value IN list, and compilation is now over 200ms. I’ve seen cases where a really complicated view had an enormous IN list and compilation took 10 minutes. (And took locks that prevented AG redo the whole time)

What happens when half a dozen threads try to run Hardcoded_Doom at the same time? (Special thanks to Adam Machanic’s SQLQueryStress). Compilation contention!

Isn’t science fun?

WHERE Clause of Even More Doom

Giant OR lists? Yeah, we can do those too. Guess what super special failure mode they have? Combining a single OR with another condition.

Instead of a fast thousand seeks, we now have a Filter operator taking a full minute. And yes, I’ve seen this one in production too.

Conclusion

Giant IN lists are a giant wrecking ball of dumb. Let me recount, these are things I’ve broken or seen broken with large IN lists: memory limit, SSMS, variable limit, compilation limit, AG redo, Query Store, plan cache, and query performance. Please teach your developers to use table types and temp tables instead, and for the love of all the SQL Bobs, don’t let them roll their own ORM.

An Empirical Look at Key Lookups

I’m going to presume you’re familiar with what key lookups are, and why they exist.

SQL Server’s cost optimizer hates them, because it thinks they’re random reads from a spinny disk – unlikely for a modern enterprise server. What I’m curious about is just how bad these assumptions are in practice.

The Setup

Here’s the million-row test table I’m going to use, along with fake data. I’m going to be testing on my laptop with a standard SSD, and an intel processor with a variable clock speed (it seems to average a bit below 4GHz for these tests).

CREATE TABLE dbo.InternetRecipes (
ID INT IDENTITY PRIMARY KEY,
Calories INT,
SizeOfIntro INT,
RecipeName NVARCHAR(300)
)

INSERT dbo.InternetRecipes
SELECT TOP (1000000)
(ROW_NUMBER() OVER(ORDER BY 1/0))%1000,
(ROW_NUMBER() OVER(ORDER BY 1/0))%10000+500,
'Paleo Fat Free Cotton Candy'
FROM master..spt_values a,
master..spt_values b


--With includes and not, for testing
CREATE INDEX IX_InternetRecipes_Calories
ON dbo.InternetRecipes(Calories)

CREATE INDEX IX_InternetRecipes_Calories_incl
ON dbo.InternetRecipes(Calories)
INCLUDE (RecipeName)

Time/Cost Comparison

Using an index hint, we can see a direct comparison between a plan with key lookups, and without. The “without” plan makes use of the index that includes RecipeName. Normally this is described as a covering index, but today I’m designating it the “inclusive” index.

--7.7 query bucks
SELECT RecipeName
FROM dbo.InternetRecipes WITH(INDEX(IX_InternetRecipes_Calories_incl))
WHERE Calories >= 0
AND RecipeName = 'Vegan Steak'

--200 query bucks
SELECT RecipeName
FROM dbo.InternetRecipes WITH(INDEX(IX_InternetRecipes_Calories))
WHERE Calories >= 0
AND RecipeName = 'Vegan Steak'
OPTION(MAXDOP 1)

Some interesting details emerge here. First: key lookup cost does not scale linearly.

Working the math backwards shows the cost comes from a million CPU costs, but only 11,495 IO costs. Incidentally, 11,495 is the number of data pages for the table. Key Lookup IO costs are capped at the number of pages in the table, although you’re unlikely to encounter this naturally. Of course, the next thing for me to test was everything in-between, and the experiments showed that the key lookup cost does not assume each will cause an IO.

The exact formula still eludes me though. I think it’s a delightfully nerdy problem that nobody will care about, so I’ll be working on it as soon as possible.

The one million key lookup query is costed about 26X more than the simple seek, and takes about 18X longer to run. The optimizer does a decent job here! Or does it? I was running the test with a warm cache, so what if we ignore IO costs?

--MS will laugh at you if this breaks your server
DBCC SETIOWEIGHT(0)

The optimizer cost ratio balloons to 149X (while actual runtime remains 18X of course). Even though the actual executions involved no IO, the query optimizer including IO costs does a better job.

Let’s fix that IO costing thing…

DBCC SETIOWEIGHT(1)

And test with a cold cache…

CHECKPOINT --if we just created the table/indexes
DBCC DROPCLEANBUFFERS  --repeat as needed

The execution time ratio is about 15X with IO – not much of a change. If I disable prefetch with 9115 (thanks Paul), the time ratio moves to 17X, and query buck costs stay unchanged. So, the optimizer does a decent job costing between the inclusive index and the key lookup plan, but only when it assumes IO costs.

Tipping Point Testing

Because SQL Server loathes key lookups so much, it will sometimes choose to scan the entire table just to avoid them. In my experience, it does this way too early. Can I prove it?

Let’s remove the inclusive index, and select results into a trash variable for testing.

DROP INDEX IX_InternetRecipes_Calories_incl
ON dbo.InternetRecipes

SELECT RecipeName
FROM dbo.InternetRecipes
WHERE Calories >= 996
OPTION(MAXDOP 1);

I find that the tipping point occurs between Calories values of 996 and 997, which is between 3000 and 4000 rows/lookups.

The full scan plan takes on average 135ms, while the key lookup plan for 4000 rows is only 9ms.

It’s not until Calories value 925, or 75k rows, that the true tipping point occurs, and the key lookup plan takes as long as the scan. Even testing with a cold cache, the tipping point is at 41k rows. So for my laptop with a warm cache, the optimizer switches to a scan at just 4% of the optimum number of rows, or 25X too soon (roughly).

Throughput Testing

What else can we learn about key lookup impact? Let’s cram it in a loop and set spin cycle to ultra high.

First, bring back the inclusive index.

CREATE INDEX IX_InternetRecipes_Calories_incl
ON dbo.InternetRecipes(Calories)
INCLUDE (RecipeName)

Now, the general loop structure:

DECLARE @i INT = 0
DECLARE @trash NVARCHAR(300)
DECLARE @dt DATETIME2 = SYSDATETIME()

WHILE @i < 10000
BEGIN

	SELECT TOP(1000) @trash = RecipeName
	FROM dbo.InternetRecipes WITH(INDEX(IX_InternetRecipes_Calories))
	WHERE Calories = 666 --Devil's food cake
	OPTION(MAXDOP 1);

	SET @i += 1

END

SELECT DATEDIFF(MILLISECOND,@dt,SYSDATETIME())

Using TOP to limit the number of rows/lookups, I measure how long it takes to run a loop of 10k queries. Here are my averaged numbers:

RowsKey Lookup Time (ms)Inclusive Seek Time (ms)
100018,2231,747
1001,857331
10391200
1250181
1k, random21,2841,824

These numbers are specific to my laptop and particular test. Still, what I see is that the impact of key lookups becomes larger the more there are. This makes sense, as there is some overhead per query.

If I assume all extra time in the 1000 row test comes from the lookups, I get a value of about 1.6 microseconds per lookup. To put that in perspective, to occupy an extra core of CPU would take 600,000 lookups/second.

I was curious if using a repeated value took unfair advantage of the CPU cache, so I also tested with randomized “Calories” values, showing minor slowdown. An extra core under these conditions (probably more representative of a production server) would need 514k lookups/s. And a big caveat: all these tests use a single table design, and are run on a personal laptop.

Summary

Lookups hurt performance, but probably not as much as you think, and definitely not as much as the optimizer thinks when considering the seek/scan tipping point. If you see a scary cost percentage number by a key lookup, remember that it’s probably junk.

Avoiding Sorts with the $PARTITION Function

Backstory

I imagine everybody has them – those annoying, bloated, misshapen tables that don’t really belong in the database. I mean, this is an expensive enterprise RDMS supporting a high volume OLTP environment, and a depressingly large portion of the storage (also expensive btw) is dedicated to decade’s worth of nvarchar(max) email attachments.

I know how hard it can be to undo someone else’s decision to treat SQL Server like a landfill, as this usually involves getting others to do something, i.e. politics. But sometimes we DBAs can make a helpful change to these archive tables on our own. And that’s where I found myself, with billions of rows in a neglected heap that couldn’t be moved, but could be turned into a clustered columnstore table for 5X compression.

Setup Script

Here’s the setup script in case you want to follow along for the repro. No billions of rows here, just 10 million.

CREATE PARTITION FUNCTION pf_years( DATETIME )
    AS RANGE RIGHT FOR VALUES ('2010','2011','2012','2013','2014','2015',
	'2016','2017','2018','2019','2020');

CREATE PARTITION SCHEME ps_years
AS PARTITION pf_years
ALL TO ( [PRIMARY] );

--Here's a table we're loading
CREATE TABLE BunchaDates
(somedate datetime,
junk char(10)
) ON ps_years(somedate)

CREATE CLUSTERED COLUMNSTORE INDEX CCX_BunchaDates ON dbo.BunchaDates ON ps_years(somedate)

--Original table, full of junk
CREATE TABLE BunchaDatesSource (
somedate datetime,
junk char(10)
)

INSERT BunchaDatesSource
SELECT TOP (10000000)
DATEADD(MINUTE,24*60*365*6*RAND(CHECKSUM(NEWID())),'1/1/2014')
,'demodemo'
FROM master..spt_values a
,master..spt_values b
,master..spt_values c

The Problem

My plan was to load the data from the heap into the columnstore table, later doing a drop and rename, but I wanted to transfer a limited amount at a time. My environment has an Availability Group set up, so my constraint was minimizing log generation. (If you’re looking for a technique to load a columnstore table as quickly as possible, check out Joe Obbish.) But, when I started loading a limited time range of data, I quickly ran into a problem.

INSERT BunchaDates
SELECT somedate,junk
FROM BunchaDatesSource bs
WHERE YEAR(bs.somedate) = '2014'

A Sort operator shows up! To explain my distaste, the columnstore table doesn’t need it – it is inherently unordered (at least in the way we think about it). Moreover, I am selecting rows guaranteed to reside within a single partition. Yet SQL Server insists on ordering rows, because it very helpfully wants us to avoid the performance penalty of switching between partitions insert-by-insert. Instead I get a slow query, ginormous memory grant, and tempdb spills. Yay!

Ok. You might notice that I used YEAR() in my predicate. What if I use a proper date range instead?

Crud. Ok. Was it the boundary dates? Maybe a narrow date range entirely within the year will work?

Fine. What about an exact time? Pretty please, oh kind and generous Query Optimizer?

The Solution

Looping through all possible times of a datetime time field in order to avoid a sort is very bad way to do things. But thankfully, there’s a better way: $PARTITION.

The $PARTITION function uses your partition function to return which partition a value belongs to, and importantly, it can be used to provide a guarantee to SQL Server that it doesn’t have to insert into multiple partitions, and will therefore prevent a superfluous sort.

INSERT BunchaDates
SELECT somedate,junk
FROM BunchaDatesSource bs
WHERE $PARTITION.pf_years(somedate) = $PARTITION.pf_years('2014')

No Sort! But there’s still room for improvement:

You see, $PARTITION doesn’t qualify for predicate pushdown, and on a query involving billions of rows, that adds up.

But guess what, we can still add the date range, and that does get predicate pushdown.

INSERT BunchaDates
SELECT somedate,junk
FROM BunchaDatesSource bs
WHERE $PARTITION.pf_years(somedate) = $PARTITION.pf_years('2014')
AND bs.somedate >= '2014/1/1' AND bs.somedate < '2015/1/1'

Conclusion

Sometimes, the optimizer misses something, and you have to babysit it. In my case, using $PARTITION allowed me to finish the table load in my time window, eventually freeing up significant space to be used for other questionable data.

The takeaway? $PARTITION lets you guarantee for the optimizer that only a single partition is involved, but you might want to retain your other predicates for predicate pushdown or perhaps index qualification.

Now excuse me while I compose yet another carefully worded email about that email attachment table…

What’s the Difference between CXPACKET and CXCONSUMER

If wait stats were celebrities, then CXPACKET would be a superstar, showing up in a gauche limo and getting swarmed by nerdy fans asking for an autograph. It would also suffer from multiple personalities, because what it represents, parallelism, is complex with many moving parts.

So Microsoft decided to split this wait into two: CXPACKET and CXCONSUMER. CXPACKET is the evil twin with a little goatee, and CXCONSUMER is your friendly neighborhood benign wait (supposedly). Parallelism remained complicated, and MS’s explanation remained simple – annoyingly so. I had to do a lot of reading and testing to develop some intuition – here’s my attempt to share.

The starting point is that exchange operators have one or more consumer threads that receive packets of data from one or more producer threads. (You can read more in Paul White’s article here.) A DOP 8 Repartition Streams will actually coordinate 16 workers – 8 on the consumer side and 8 on the producer.

In general, CXCONSUMER waits represent consumer threads waiting, and CXPACKET the producer side. I visualize it like this:

This is, of course, a drastic simplification. For one, the above only represents a single worker on each side. Reality will be more complex. Also, CXPACKET waits aren’t limited to the producer side – it’s common to see them on the consumer side as well, especially in Gather Streams.

Maybe it’s a little easier now to understand why Microsoft guidance is to ignore CXCONSUMER waits as harmless. Normally, the heavy lifting of a query is done on the producer side – of course the consumer threads will be waiting for data, and they will normally finish their own work before the next packet from the producer side is ready. But certain adverse situations (blocking, skew) still explode CXCONSUMER waits, so the guidance is merely a guideline.

Just because I say something doesn’t make it so, whether or not I claim authority. So here, have a simple demo:

I start with a 10 million row table, using a script modified from the wonderful dbfiddle.uk

DROP TABLE IF EXISTS dbo.A
CREATE TABLE dbo.A (ID int primary key)

;with
  p0(i) as (select 1 union all select 1 union all select 1 union all select 1)
, p1(i) as (select 1 from p0 as a, p0 as b, p0 as c, p0 as d, p0 as e)--1K rows
, p2(i) as (select 1 from p1 as a, p1 as b, p1 as c)

INSERT dbo.A
SELECT TOP(10000000) ROW_NUMBER() OVER(ORDER BY 1/0)
FROM p2

And here’s a wrapper you can use to grab waits for the query

DROP TABLE IF EXISTS #waits

SELECT *
INTO #waits
FROM sys.dm_exec_session_wait_stats
WHERE session_id = @@SPID

--Query goes here

SELECT sws.wait_type, sws.waiting_tasks_count - ISNULL(w.waiting_tasks_count,0) AS wait_count,
sws.wait_time_ms - ISNULL(w.wait_time_ms,0) AS wait_ms
FROM sys.dm_exec_session_wait_stats sws
LEFT JOIN #waits w ON w.wait_type = sws.wait_type
WHERE sws.session_id = @@SPID
AND sws.waiting_tasks_count > ISNULL(w.waiting_tasks_count,0)

Ok, now let’s create a query that’s deliberately slow on the consumer side using expensive nested HASHBYTES calls

DECLARE @trash1 VARBINARY(64)

SELECT @trash1 = HASHBYTES('SHA2_256',HASHBYTES('SHA2_256',HASHBYTES('SHA2_256',HASHBYTES('SHA2_256',
			CONVERT(VARCHAR(10),A.ID)
			))))
FROM dbo.A
WHERE A.ID%10 < 100

Here’s the plan:

High CXPACKET query
Ouch!

Slow down the producer side with the below query (and remember that actual execution plans don’t include CXCONSUMER waits, so you’ll have to use the wrapper)

DECLARE @trash2 INT

SELECT @trash2 = A.ID
FROM dbo.A
WHERE 0x800000 < HASHBYTES('SHA2_256',HASHBYTES('SHA2_256',HASHBYTES('SHA2_256',HASHBYTES('SHA2_256',
			CONVERT(VARCHAR(10),A.ID)))))

And the plan for that…

High CXCONSUMER query

Which has high CXCONSUMER waits, but not as high as the CXPACKETs were, because there’s only one consumer thread for the Gather Streams exchange.

So there you go: an animation and demo for the superstar twins of wait stats. Excuse me while I try to get another autograph demo…

HT Waits – Explained and Animated

Remember when you started learning wait stats, and were told that CXPACKET just means parallelism, nothing to see here, move on? HTBUILD is the new CXPACKET. Microsoft documentation on the HT waits is a little, uh, thin. Have a problem? Increase the cost threshold for parallelism (which you still can’t do in Azure SQL DB by the way – it’s stuck at 5).

If you’re anything like me, the CXPACKET explanation wasn’t sufficient, and the current situation with HT waits certainly isn’t either. Googling for them didn’t turn up much, so I put on my safety goggles and fired up extended events. And lots of demo queries. And the debugger. I didn’t even bluescreen my system this time, probably due to the goggles.

There are three main HT waits that I focused on: HTREPARTITION, HTBUILD, and HTDELETE. To get these, you need a query plan where three things combine: a (1) parallel (2) batch mode (3) hash operation.

I’m working on an amateur explanation of the batch mode hash join, which I’ll later link to. The short version is this: it starts with a Repartition phase, which is where it consumes batches of rows. I assume there’s some repartitioning going on, but I don’t know what that actually means here. There’s a synchronization gate at the end of this phase, and any threads that finish their work before the others will wait with HTREPARTITION.

After this is the Build phase, which seems to actually build the hash table. Like earlier, there’s a gate at the end, and threads that finish first wait with HTBUILD.

Afterwards is the Probe phase. This is where the hash join finds matching rows and outputs them. There are couple non-HT waits here, and skew normally shows up with our old friend CXPACKET.

Next comes Cleanup phase…except one of the weird things about batch mode hash operations is that they can share memory. So “next” might not be next, but after whatever hash operations are chained together. Cool huh? Same deal here – threads that are done wait with HTDELETE.

Blah blah blah, who learns by reading? LOOK AT THE PRETTY PICTURE:

Some notes: all of the hash aggregates I tested lacked a Repartition phase. This means that HTREPARTITION only shows up for joins. I’m not certain of this however, and would love a repro of a hash agg with this wait. Also, a thread can repeat an HT wait within a phase. There seem to be some extra synchronization activities at the end, and workers blink on and off with these before moving on.

I admit it; ultimately, HT* waits do look a lot like CXPACKET. They show up with parallelism and skew, and there’s not much you can do to easily affect them (ignoring silly MAXDOP 1 suggestions). I learned enough about these that I think I could track down problems if necessary, but so far it hasn’t been necessary. Thankfully I’m a nerd, and knowledge is its own reward. And knowledge with a gif is even better.

MAXDOP is a Lie

Let’s start with STATISTICS TIME output from a query:

How many cores is it using? 124ms of cpu over 42ms… = an average of 2.95 cores per second.

Now what if I told you this was a MAXDOP 2 query?

It’s true, this was a MAXDOP 2 query, and it was definitely using more than 2 cores at once (though admittedly STATISTICS TIME can be imprecise). A friend of mine noticed this situation happening, and Paul White provided the explanation: it’s only the parallel branches that are limited to MAXDOP. The coordinator worker (the serial section at the far left of the plan) runs at the same time. Usually it doesn’t have much to do, and the parallel zones don’t run full speed, so you don’t see more than MAXDOP CPU usage.

But now that we know this is possible, how would you design a query to deliberately use more than MAXDOP? My own solution was to find a plan with minimal blocking, and force computation on the serial thread with TOP and some column weirdness.

And here’s the monstrosity I used (StackOverflow2010):

;WITH CTE AS (
SELECT TOP 100000 'a' Val
FROM dbo.Posts p1
JOIN dbo.Posts p2
	ON p1.Id = p2.Id
	)

SELECT 
COUNT(CHECKSUM(CHECKSUM(CHECKSUM(CHECKSUM(CHECKSUM(CHECKSUM(CHECKSUM(CHECKSUM(CHECKSUM(CHECKSUM(
CASE WHEN CONVERT(nchar(1),Val) = N'a' THEN 1 END
)+1)+1)+1)+1)+1)+1)+1)+1)+1+1.1))
FROM CTE
OPTION(
QUERYTRACEON 8649, MERGE JOIN,
MAXDOP 2)

It’s…not elegant, so I’m curious what examples others can find. But hey, this is a query that uses more than MAXDOP!

Edit: it seems that I accidentally trolled Pedro Lopes with the title. No, MAXDOP is not a lie, technically it’s in the documentation (and here, links courtesy of Pedro). Sorry Pedro! However, if you really want to learn how things work, I recommend reading Paul White’s article instead, which happens to be *both* accurate and readable.