The Top/Max Transformation

Behold, a simple MAX query:

But wait, where did that Top operator come from? Oh, SQL Server, you clever rascal…

Simple and extremely effective – a solid query transformation. But can you handle more complex versions, SQL Server?

Let’s try the bane of ordered scans: partitioned tables.

Please note that all repro scripts are shared at the bottom

Partitioning

The transformation works if the column is both the leading index key and the partitioning key.

But fails if the leading key is not the partitioning key.

I guess that makes sense to me, because it would have to grab the top value from each partition then sort those, a much more complicated transformation.

Second Column

Can SQL Server use the Top/Max transformation when looking at the second column of an index? In theory it should work.

Time for our favorite DBA sing-along song, It Depends.

Here’s the table, clustered on c1, c2, loaded with some filler data as seen in the repro script at the end.

CREATE TABLE dbo.A(
c1 INT NOT NULL, 
c2 INT NOT NULL,
INDEX CX_somecreativename CLUSTERED (c1,c2)
)

The simplest form works

SELECT MAX(c2)
FROM dbo.A
WHERE c1 = 3

What if we make it more complicated? Let’s add a basic temp table.

CREATE TABLE #3 (three INT NOT NULL)
INSERT #3 VALUES (3)

This works…

SELECT MAX(c2)
FROM A
WHERE c1 = (SELECT TOP(1) 3 FROM #3)

But this doesn’t

SELECT MAX(c2)
FROM A
WHERE c1 = (SELECT TOP(1) three FROM #3)

Spot the difference? Pulling a column value from the second table breaks the transformation, but pulling a constant keeps it. Weird. Anyways, let’s try a rewrite (though inexact) of the query.

SELECT the_max
FROM (SELECT TOP(1) three AS c3 FROM #3) s
CROSS APPLY (
	SELECT MAX(c2) AS the_max
	FROM dbo.A
	WHERE A.c1 = s.c3
	) x

Victory again! Wait, what’s that? A change to the table definition making the second column nullable?

ALTER TABLE dbo.A
ALTER COLUMN c2 INT NULL

Ok fine, I’m sure the last query won’t…oh.

So the TOP/MAX transformation is finicky on a second column, depending on nullability [is that a word?] and how the query is written.

Going Backwards

There’s even a situation where SQL Server should (should!) be able to do a reverse transformation, a TOP into a MAX. Columnstore indexes store maximum values per segment, so a TOP(1) query could simply grab that!

Conclusion

I can’t tell you how tempted I was to call this MAXTOP. Thinking about it still causes stray giggles to bubble up, as I gleefully envision sharing this info with eager young DBAs at a SQL Saturday, then ushering them off to a session on parallelism. Thankfully I’m not that evil.

The real conclusion is that SQL Server is programmed to be very clever. Despite the cleverness though, details matter, and the cleverness often falls short. What’s that warm glow I feel inside? Ah yes, job security.

Repro scripts

USE TestDB
GO

CREATE PARTITION FUNCTION pf_hundreds(int)
    AS RANGE RIGHT FOR VALUES (100,200,300,400,500,600,700,800,900);

CREATE PARTITION SCHEME ps_hundreds
AS PARTITION pf_hundreds
ALL TO ( [PRIMARY] );
GO

DROP TABLE IF EXISTS PartitionedNumbers
GO

CREATE TABLE PartitionedNumbers (
Number1 INT NOT null, 
Number2 INT NOT NULL, 
CONSTRAINT PK_numnum PRIMARY KEY CLUSTERED(Number1,Number2))
ON ps_hundreds(Number1) --drop/create table partitioned on Number2 for second query

INSERT PartitionedNumbers
SELECT TOP(999) ROW_NUMBER() OVER(ORDER BY 1/0), ROW_NUMBER() OVER(ORDER BY 1/0)
FROM master.dbo.spt_values

SELECT MAX(Number1)
FROM PartitionedNumbers
USE TestDB
GO

DROP TABLE IF EXISTS dbo.A
DROP TABLE IF EXISTS #3
go

CREATE TABLE dbo.A(
c1 INT NOT NULL, 
c2 INT NOT NULL,
INDEX CX_somecreativename CLUSTERED (c1,c2)
)

INSERT dbo.A
SELECT TOP(100000) 
(ROW_NUMBER() OVER(ORDER BY 1/0))%100,
(ROW_NUMBER() OVER(ORDER BY 1/0))%500 + 250
FROM sys.columns a, sys.columns b, sys.columns c

CREATE TABLE #3 (three INT NOT NULL)
INSERT #3 VALUES (3)

--works
SELECT MAX(c2)
FROM dbo.A
WHERE c1 = 3

--works
SELECT MAX(c2)
FROM dbo.A 
WHERE c1 = (SELECT 3)

--works
SELECT MAX(c2)
FROM A
WHERE c1 = (SELECT TOP(1) 3 FROM #3)

--does not work
SELECT MAX(c2)
FROM A
WHERE c1 = (SELECT TOP(1) three FROM #3)

--works
SELECT the_max
FROM (SELECT TOP(1) three AS c3 FROM #3) s
CROSS APPLY (
	SELECT MAX(c2) AS the_max
	FROM dbo.A
	WHERE A.c1 = s.c3
	) x


--change things around
ALTER TABLE dbo.A
ALTER COLUMN c2 INT NULL

--does not work
SELECT the_max
FROM (SELECT TOP(1) three AS c3 FROM #3) s
CROSS APPLY (
	SELECT MAX(c2) AS the_max
	FROM dbo.A
	WHERE A.c1 = s.c3
	) x

No Running at the Resource Pool

Ok, I get it, scheduling queries can be complicated. See this and this and maybe this and this too but only if you have insomnia. I still thought I kinda understood it. Then I started seeing hundreds of query timeouts on a quiet server, where the queries were waiting on…what? CPU? The server’s at 20%, why would a query time out after a minute, while only getting 5s of CPU, and not being blocked?

It was Resource Governor messing up one of my servers recently in a way I never expected. Apparently RG will try to balance *total* CPU usage for pools, to the point that a single query sharing a scheduler will be completely starved.

Uh, this might be easier to explain with a picture:

And a demo of course:

Let’s set up some Resource Pools with no constraints. (Thanks to this blog for making some easy-to-copy scripts). Again, these pools don’t limit anything – the only thing they do is exist. Behold, pools Apple and Banana:

USE master
GO
 
CREATE RESOURCE POOL [Apples] WITH(
		min_cpu_percent=0, 
		max_cpu_percent=100, 
		min_memory_percent=0, 
		max_memory_percent=100, 
		AFFINITY SCHEDULER = AUTO
)
GO
 
CREATE RESOURCE POOL [Bananas] WITH(
		min_cpu_percent=0, 
		max_cpu_percent=100, 
		min_memory_percent=0, 
		max_memory_percent=100, 
		AFFINITY SCHEDULER = AUTO
)
GO

CREATE WORKLOAD GROUP [Apple] 
USING [Apples]
GO
CREATE WORKLOAD GROUP [Banana] 
USING [Bananas]
GO

CREATE FUNCTION dbo.Fruit_Inspector() 
RETURNS SYSNAME WITH SCHEMABINDING
AS
BEGIN
  DECLARE @workload_group sysname;
 
  IF (program_name() LIKE 'A%')
      SET @workload_group = 'Apple';
  IF (program_name() LIKE 'B%')
      SET @workload_group = 'Banana';
     
  RETURN @workload_group;
END;
GO

ALTER RESOURCE GOVERNOR WITH (CLASSIFIER_FUNCTION = dbo.Fruit_Inspector);
GO

ALTER RESOURCE GOVERNOR RECONFIGURE
GO

Now for a little bit of SSMS trickery: you can use connection properties to set an application name of your choice, assigning your connection to a resource pool.

Check which scheduler you’re on (and your session id, workload group, and app name) with this query:

SELECT er.scheduler_id, wg.name AS wg_name, er.session_id, es.program_name
FROM sys.dm_exec_requests er
JOIN sys.resource_governor_workload_groups wg
ON wg.group_id = er.group_id
JOIN sys.dm_exec_sessions es
ON er.session_id = es.session_id
WHERE er.session_id = @@SPID

Get a session ready for Resource Pool Apple – it doesn’t matter what scheduler it’s on. Place an infinite-loop CPU-sucking query in it, ready to go. I don’t start it until later.

DECLARE @trash bigint

WHILE 1=1 
BEGIN
	SELECT @trash = COUNT(*) FROM dbo.Votes
end

Now get sessions for Pools Apple and Banana that happen to land on an identical scheduler (I did this by opening a bunch of Apple sessions, then reconnecting the Banana session until I got a match).

Start the other infinite-loop query, then run a high-CPU query on each of colliding sessions – here’s what I used:

SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
OPTION(MAXDOP 1,USE HINT('QUERY_OPTIMIZER_COMPATIBILITY_LEVEL_140'))

Remember we have two queries running from Pool Apple, and one from Pool Banana, where the Banana query is on the same scheduler as an Apple query. Here’s the Banana runtime – 15s of pure CPU:

But here’s the Apple runtime:

Apple took 30s because it wasn’t allowed *any* CPU time while Banana was running. Because SQLOS was trying to balance the resource pools with the infinite query running, the Apple query was waiting on SOS_SCHEDULER_YIELD for the whole Banana query, which I do think is absolutely bananas.

Remember the Banana CPU time was 15180

You’ll even see super-long scheduler yield waits with this pattern.

If the totals look off, that’s because I tested multiple times in these sessions

Yes, seriously, waiting 5 seconds to get time on the CPU.

Please note this doesn’t even need to be a heavy query from Pool A:

“Working as intended” – Top Support Man

In summary, have the doodle again:

In other summary, Resource Governor is scary, I blame it for hundreds of avoidable timeouts, and I hope MS improves their documentation.

P.S. If you followed my demo, that infinite loop is still running – might wanna kill it. Even worse, RG is still around, which you can fix with the below:

ALTER RESOURCE GOVERNOR DISABLE

Zombie Transactions Ate my Database

A failover that needs five minutes for the secondary database to come online is a problem, especially if it generates stack dumps. Thankfully, Microsoft has Top Men, and with their help digging into source code, we uncovered…well, something neither of us had ever seen before. I’m naming them Zombie Transactions, because then I can pretend my job is more glamorous than clicking at a screen all day.

Why the Long Failover?

Every time there’s a failover, SQL Server must run the recovery process. The standard Analysis, Redo, and Undo occur, and even sometimes an extra step. In our case, it was Redo taking a ridiculously long time. According to XE, SQL Server was searching through the log for 25,000 active transactions every recovery (and incidentally producing stack dumps because this took so long).

Except these transactions weren’t really live, they were dead, shambling about our system, eating our performance. They didn’t come from normal activity, they didn’t show up in dmvs, and they couldn’t be killed with a failover.

I eventually found where they were lurking by using fn_dblog – they were in our checkpoints. Every checkpoint needs to include active transactions – it’s part of the design that allows a checkpoint to occur without freezing your whole database. And for some reason, our database was including this horde of undead transactions in each checkpoint.

Unresolved Non-DTC Cross Database…What?

This is where Microsoft support was helpful (you know, Top support, not the level 1s you have to put up with for the first month of an issue). The checkpoint entries were identified as “unresolved non-DTC cross-database transactions.” This awkward tongue-twister is the other reason I’m calling them zombies.

In other words, these are leftovers from transactions spanning multiple databases with an AG (for example, a trigger that logs changes to a separate non-AG database). If the transaction doesn’t complete correctly, it can leave an unresolved zombie behind. A minimal repro succeeded at generating a couple with a mid-transaction failover, though we never figured out how exactly 25k came about.

Zombie Hunting

I got tired of checking manually, so I wrote a script to search for zombies. It uses fn_dblog, which is undocumented and comes with caveats. I recommend running on a secondary.

USE [MyDB]
GO
CHECKPOINT
go
DECLARE @base bigint
declare @desc table (descr varchar(500))

select top(1) @base = CONVERT(bigint,CONVERT(varbinary(6),'0x'+REPLACE([Transaction ID],':',''),1))
from sys.fn_dblog(null,null)
where [Transaction ID] <> '0000:00000000'
insert @desc
select Description
from sys.fn_dblog(null,null)
where Operation = 'LOP_XACT_CKPT'

select COUNT(case when xacts.xval < @base then 1 end) as zombies, COUNT(*) as total
from (
    select CONVERT(bigint,CONVERT(varbinary(6),'0x'+x.value,1)) xval
    from (
        select REPLACE(REPLACE(descr,'XdesIDs: ',''),':','') xdes
        from @desc
    ) l
    cross apply STRING_SPLIT(l.xdes,',') x
) xacts
where xval <> 0

How to Fix [Everything with Computers]

There was some back and forth with Microsoft on how to remove these – I didn’t want to edit memory, and restarting the server didn’t help. However, I quickly discovered that setting the database offline then online fixed it. That’s right, turning it off then on again was a solution.

After some more experimenting, we figured out that setting the database to Read-Only then back would fix it, and was faster than offline/online. A teammate also discovered that reseeding is necessary, since the zombies would reappear if you fail over. He built an awesome script for doing this, and I’m sure he would be willing to help for entirely reasonable rates™. Our failover time went from five minutes to ten seconds after we cleared the zombies.

Summary

You might have zombie transactions if you’ve been running an AG without the DTC setting. If you do, and there are a lot, they can slow down failovers. To kill them off, you’ll have to set the database to read-only mode, or bring offline-online. Yes, this requires downtime.

Please let me know if you find them in your system – seriously, let me know, and let Microsoft know. That way there’s a chance of us getting a way to fix without needing downtime.

Fixing Queues with Watermarks

Queues are pretty common, but it’s not too often I encounter them in SQL Server. There’s a reason: SQL Server is bad at them. I’m not just talking about Service Borker – even native table implementations crumble under heavy load.

The standard pattern looks something like below – allowing multiple threads to pull a work item from the single queue table.

WITH CTE AS 
    (SELECT TOP (1) *
    FROM dbo.testq WITH (ROWLOCK, READPAST) --locking hints allow concurrency
    ORDER BY ID)
DELETE CTE
OUTPUT Deleted.filler
INTO @d --table variable to store the stuff to process

I recently had the pleasure of investigating a procedure that pulled from a queue. Normally it was fast, but occasionally runtime would spike. The spooky thing was the query was using an ordered index scan that should only read one row, but during the spikes it was reading thousands.

Surely there’s a rational explanation…

…what to blame when your queue table has a bad hair day

Seriously. In the standard implementation of a queue table, dequeued records are deleted. But SQL Server typically doesn’t remove deleted rows immediately. Instead it marks them as ghost records (aka soft deletes), and exorcises them later, which is why a TOP(1) can easily read a lot more than one row.

So when the dequeue finds the next record to work on, it scans from the start of the index through ghosted rows to reach it. This isn’t a problem if there are only a couple, but sometimes servers can be so busy that Ghost Record Cleanup can’t keep up, or the ghost records are needed for versioning (which can easily happen if you use AGs). In those cases, the number of ghosts grows and grows, and you end up haunted by bad performance.

It was an intriguing problem, and I worked like a man possessed. I tested some bizarre approaches to shift rows around or force cleanup without blocking parallel dequeues, but what ended up succeeding was a watermark pattern. In a separate table, I store the minimum identity processed. Then, every dequeue can range scan starting from the watermark.

In the original implementation, doubling the number of rows to dequeue would quadruple the runtime. 30k rows took 2:50, while 60k rows took 10:01. For the watermark pattern however, 30k rows dequeued in :23, and 60k in :48. A clear win!

Here’s roughly what my implementation looked like. If you want to test for yourself, you can force ghost records (specifically version ghosts) by leaving a select transaction open with snapshot isolation enabled. I then spammed dequeues in while loops with the help of SQLQueryStresser.

First, the tables:

DROP TABLE IF EXISTS dbo.testq;
DROP TABLE IF EXISTS dbo.qtrack;

CREATE TABLE dbo.testq
(
    ID INT IDENTITY PRIMARY KEY,
    inserttime DATETIME2(2) CONSTRAINT DF_qq2 DEFAULT GETUTCDATE(),
    filler CHAR(400)
)

CREATE TABLE dbo.qtrack --watermark table
(
    ID INT PRIMARY KEY
)

INSERT dbo.qtrack
VALUES (0)

INSERT dbo.testq (filler)
SELECT TOP (30000) 
    REPLICATE(CONVERT(VARCHAR(36), NEWID()), 11) --filler
FROM sys.all_columns a,
     sys.all_columns b;
GO

Here’s the normal style of dequeue:

CREATE OR ALTER PROC dbo.dequeue_regular
AS
BEGIN
    DECLARE @d TABLE
    (
        junk VARCHAR(500)
    );
    WITH CTE
    AS (SELECT TOP (1)
               *
        FROM dbo.testq WITH (ROWLOCK, READPAST)
        ORDER BY ID)
    DELETE CTE
    OUTPUT Deleted.filler
    INTO @d;
END;

And the dequeue using the watermark table:

CREATE OR ALTER PROC dbo.dequeue_watermark
AS
BEGIN

    DECLARE @d TABLE
    (
        ID INT,
        junk VARCHAR(500)
    );

    DECLARE @w INT = (SELECT TOP 1 ID FROM dbo.qtrack)

    WITH CTE
    AS (SELECT TOP (1)
               *
        FROM dbo.testq WITH (ROWLOCK, READPAST)
        WHERE ID >= @w
        ORDER BY ID)
    DELETE CTE
    OUTPUT Deleted.ID,
           Deleted.filler
    INTO @d;

    DECLARE @t INT = (SELECT TOP 1 ID FROM @d)

    IF @t % 100 = 0
    BEGIN
        UPDATE dbo.qtrack
        SET ID = @t - 50;
    END;
END;

There are several things to note with the way I designed this. Updating the watermark in every concurrent dequeue would be a Bad Idea, so I added the %100 such that roughly every hundredth would move the watermark. Also, I didn’t simply adjust the watermark to the processed value, but rather some buffer amount lower. To be super safe, we also set up an every 5min job to rebuild the table, correct the watermark, and log any issues.

And that’s it, a method to protect yourself from ghosts using SQL. It’s been working like a charm for us, and I hope it works for those of you who need it. Oh, and please let me know if you encounter data Bigfoot – I’d like to build a proc for him too.

A Doodled Intro to the Storage Engine

Paul Randal is a SQL Server legend with loads of informative articles. But when I was a baby DBA first reading Inside the Storage Engine, I got a little stuck. It took many passes before, eventually, finally, it clicked. I wish I had a lightweight introduction, so in the practice of paying it forward…

Here’s the starting point: sometimes it’s easier to manage lots of small things (say, the 1s and 0s of data) by grouping them into larger things. It’s the same reason you don’t buy rice by the grain.

For SQL Server, data is logically grouped into 8KB Pages, which are themselves grouped into Extents (an Extent being 8 continuous pages).

Within a file, you need to know where the data is, and that’s why there are mapping pages!

SQL Server uses a combination of maps to describe page usage. The GAM page shows which extents are available (thus it’s 0 for allocated, 1 for open/unused). The SGAM exists because pages in an extent might be used for multiple tables (a mixed extent), or reserved for a single table. The SGAM marks which mixed extents can have a page allocated.

Most mapping pages are data about the space within the file, and are repeated in fixed locations for the area they cover. For example, the GAM page, since it covers about 4GB, is also repeated every 4GB.

Mapping pages for tables, called IAMs, need different treatment however. Think about it – Microsoft doesn’t know ahead of time what tables you’ll have, and can’t have IAMs in fixed locations. So the IAMs contain info about what they cover and where the next IAM is, which allows them to be placed anywhere in the file.

And there’s still more to learn, like those pesky PFS pages! I’m not going to write about them though; Paul already did. So go! Continue your learning journey and profit from your hard-earned knowledge!

Finding a Spinlock Owner in a Dump

Oh. Lovely…

Well, this SQL Server has dumps. At least it stays regular. These happen to all be of the “Non-Yielding Scheduler” variety. Now for the fun task of dump diving.

The Journey Begins

Windbg makes starting relatively easy…load the dump:

Click the blue text:

And wait a long time for windbg to load symbol files. (You don’t actually have to wait if you’re just reading a blog. Kinda nice, huh?) When it finally finishes, behold, the offending stack of a dump:

Now, I may not be an expert, but sqlmin!Spinlock sounds like…a spinlock. This thread has been spinning for over a minute, never returning to a waiting state, because something else is holding the spinlock resource.

Thankfully, helpful friends alerted me to a blog that revealed the value of an acquired spinlock “is the Windows thread ID of the owner.” Meaning I might be able to find the cause.

A Different Approach

I already had a suspect thread – it’s the only other one active when I look at all of SQL Server’s threads.

Actually digging through hundreds of threads is annoying of course, and there’s a slightly faster way: !uniqstack, which removes duplicates. (Thanks to Bob Ward for sharing this trick in one of his presentations.)

If you run that, you’ll see a multitude of stacks with something like this at the top.

Those NtSignalAndWait calls are how SQL Server makes a worker wait. Those workers weren’t doing anything at the time the dump was taken.

However, there’s another thread that isn’t waiting:

I bet this one is holding the spinlock, but I have to prove it.

Obstacles

I need to find the memory address of the owner, and confirm which part of this info is the Windows thread. Let’s start by looking at the spinlocked thread.

Thankfully, the context is already 645, but if we needed to change it, we could use ~645s. Note that 645 is NOT the Windows thread id. Instead, we can look at the part that says Id: 1d60.3914. 1d60 is the process (sqlservr.exe) id, and 3914 is the thread id.

Let’s pull up the registers saved for the thread.

There are a load of them, but we want rip. The rip saves the instruction address this thread was working on at the time, which should be in the spinlock code, and the spinlock code is what we want to figure out where the memory address is.

Cool. Now we can copy that address into the disassembler.

Found it! The critical part of a spinlock has to be an atomic compare and swap – the lock cmpxchg. In this case it’s looking at the memory address in rdi and replacing the value with rsi (unless the lock is already held, which it is). Looking at our registers, rsi is indeed the Windows thread id, 3914. Now we just look at the memory address held in rdi.

d displays a memory address, dq means look at quad words, and L1 means show just one

And there we see the Windows thread id of the spinlock owner, 1280!

We can switch to that thread with ~~[1280]s…yup, it’s the same one I found before, the one that looks like it’s doing Redo operations. Victory!

Of course, when I bothered to actually read the documentation on cmpxchg, I learned that if a different value is in the target address (i.e., the spinlock is held by another thread), that value is placed in the eax register. Meaning I could have bypassed the whole exercise with r eax.

Concussion

Nope, not a typo. The appropriate ending to a windbg post is indeed head trauma. Hope you enjoyed it.

Reasoning about Indexed View Locking

Edit: Paul has pointed out some significant gaps, so I will be reworking this post in the future. Locks are accurate for this particular scenario, but apparently there are other behaviors to consider.

Sometimes I hit a weird query where I wonder if, just maybe, an indexed view would help. Then a tingling sensation crawls down my spine and I run screaming in terror, because indexed views have a reputation. Ghastly deadlocks, phantom issues, and sinister edge cases haunt my DBA sensibilities, scaring me away.

But I prefer understanding – I prefer science. So I investigated and experimented, poked and prodded. Great thanks to Paul White and Erik Darling. Erik somewhere spelled out the key idea: one/many-to-many joins in the view cause serializable locks.

My setup script (full script at the bottom) creates two identical tables A and B of even numbers 2 to 10, with an indexed view AB that joins on A>B in a one-to-many relationship.

CREATE OR ALTER VIEW dbo.AB
WITH SCHEMABINDING
AS
SELECT A.ID AS AID, COUNT_BIG(*) AS bigcount
FROM dbo.A
JOIN dbo.B
	ON B.ID < A.ID
GROUP BY A.ID

A modification to a table here has three objects involved, with associated locks: the modified table, the joined table, and the view. First, the locks associated with the original modification are taken, then, the locks on the other table as the engine figures out which rows in the view are affected, and finally the rows in the view.

But that was hard to keep all in my head, so I drew it out. Here are the objects (keys only).

A is the “one” side of the one-to-many, so a modification there involves “normal” locks. But a modification to B, since it can be referenced by multiple rows of A, will involve the super special magical serializable locks.

Let’s look at an insert:

The row inserted into B has an X lock. Next, it will need to search through A to find matching rows of AB, meaning transient S locks in A.

The S locks are released after rows are read, but when the engine arrives at AB, we see range locks (and please note the order of S and Range locks can be interleaved, but I separate them for simplification).

But why range locks? Because the many-ness of the relationship means there are potentially conflicting inserts that wouldn’t be covered by just key locks.

Yikes, no wonder these produce deadlocks! Also, the engine is taking more range locks on the the view than it strictly needs to – probably to prevent complicated programming around edge cases.

Maybe it’s just how my brain works, but drawing these out is what finally made the locking strategy click for me. I don’t consider indexed view locking as mysterious anymore (yay science!), but I do find it monstrous. Be warned!

--Setup
CREATE TABLE A (ID INT PRIMARY KEY, junk CHAR(10))
CREATE TABLE B (ID INT PRIMARY KEY, junk CHAR(10))

INSERT A
SELECT TOP (5) 
(ROW_NUMBER() OVER(ORDER BY 1/0))*2,
'data'
FROM sys.columns

INSERT B
SELECT TOP (5) 
(ROW_NUMBER() OVER(ORDER BY 1/0))*2,
'data'
FROM sys.columns
GO

CREATE OR ALTER VIEW dbo.AB
WITH SCHEMABINDING
AS
SELECT A.ID AS AID, COUNT_BIG(*) AS bigcount
FROM dbo.A
JOIN dbo.B
	ON B.ID < A.ID
GROUP BY A.ID
GO

CREATE UNIQUE CLUSTERED INDEX CX_AIDs ON dbo.AB (AID)
GO

Reverse Engineering the Key Lookup Cost Formula

During a recent investigation into Key Lookup costing, I discovered something interesting. Perhaps a warning is due: I’m a nerd, and things I find “interesting” may cure insomnia in others. But, dear reader, if you’re taking time to read a database blog, you just might be a nerd as well.

Examining a key lookup, they appear to have a cost of 0.0001581 CPU and 0.003125 IO. However, running the math over several experiments showed that the per lookup cost decreased with the number of lookups. But how much? What’s the formula?

Gathering Data

I didn’t know if I would need a complicated curve fitting approach, but I did know the more data points the better. I also knew that working through hundreds of queries and manually recording query buck costs would suck and I didn’t want to do it. But how could I automate pulling costs from execution plans of hundreds of queries? Oh. Right. It’s PowerShell time isn’t it?

Step 1 of course had to be SQL setup. I used a thousand-page table with one row per page.

CREATE TABLE Testable (
ID1 INT IDENTITY PRIMARY KEY,
ID2 INT,
filler CHAR(8000)
INDEX IX_testable_ptooie (ID2)
)

INSERT Testable
SELECT TOP(1000)
ROW_NUMBER() OVER(ORDER BY 1/0),
'junk'
FROM sys.columns

Step 2 was getting the estimated execution plan using PowerShell. Step 3 was getting the PowerShell to actually work. Step 4 was using XPath inside a PowerShell script, so you can guess what steps 5-20 were debugging. Blech…XML schemas. Special thanks to Rob Farley and Jonathan Keyhias for sharing useful code.

function Get-SqlPlan([string] $query){
return ([xml](invoke-sqlcmd -MaxCharLength 9999999 -Server "." -Database "TestDB" -Query ";set showplan_xml on;`ngo`n $query;`ngo").Item(0));
}

$results = @{}

for ($i = 1; $i -le 1000; $i++) {

$query = "SELECT *
FROM Testable WITH(INDEX(IX_testable_ptooie))
WHERE ID2 <= $i
OPTION(MAXDOP 1)"

$t = Get-SqlPlan($query)
$ns = new-object 'System.Xml.XmlNamespaceManager' $t.NameTable;		
$ns.AddNamespace("ns", "http://schemas.microsoft.com/sqlserver/2004/07/showplan");
$results.add($i,$t.SelectNodes('//ns:RelOp[@LogicalOp="Clustered Index Seek"]',$ns).EstimatedTotalSubtreeCost)
}

$results.GetEnumerator()| export-csv -NoTypeInformation -Path "C:\Temp\Key Test.csv" 

Reverse Engineering

I was left with a CSV full of costing info for a thousand key lookups.

Time for some Excel analysis. (Look Ma, I’m a Data Scientist!) First, I calculated the additional cost of each key lookup. Quick tests confirmed that additional CPU cost was constant, but only IO cost decreased. This makes sense, because as key lookups continue, there’s a chance that the page needed was already pulled into the buffer pool.

Yup, weird results for the second through fourth lookups

The decrease obviously wasn’t linear, so I decided to try an approach from statistics. There’s a class of related problems including buckets and balls, and matching birthdays. In the case of a thousand-page table, the second lookup should occur on a random page with a 999/1000 chance to be different from the first lookup’s page. The third lookup should have 999/1000 chance to be different from the first, and 999/1000 chance to be different from the second, so (999/1000)^2. Generalizing, I’d expect the Nth lookup to have a (999/1000)^(N-1) chance of needing a new page (and thus an IO).

Let’s plug in the formula and see…

Almost…but for some reason, the empirical exponent turned out to be N-2 instead of N-1. I can’t come up with why, but whenever I see weird exponents in statistics the explanation is something something *handwaving* degrees of freedom.

There are still two other quirks. The first is that lookups 2, 3, and 4 don’t follow the pattern – not even close. I have no idea why this is the case. The other quirk shows up when looking at formula versus empirical lookup costs for high lookups.

Marginal IO costs only approximately follow the formula here, with a lot of repeated values. My guess is that doing a heap of math isn’t very efficient when SQL Server is trying to quickly determine a plan, so it uses saved values, approximations, or other tricks.

Math

Anyways, here’s what I learned about Key Lookup costs: the CPU component is always 0.0001581 per lookup. IO costs are capped at 0.003125 times the number of pages in the table. The extra IO cost of the Nth lookup is approximately (see prior caveats) (PageCount-1/PageCount)^(N-2). Or, using my fancy math font program:

fancy program ran out of room to write the part about max page costs…

So there you have it, something about SQL Server with no practical value. Can you believe I spent vacation time doing this? Actually, maybe I can troll some MS folks by saying they got their estimation math wrong. I do believe the people who wrote the estimation (and who are much smarter than I am) had some really good reasons, but hey, why let that get in the way of inflammatory tweeting?

Parse Time vs Compilation Time

I’ve been messing around with long compilations recently, and a question was lingering in the back of my mind, based off of SET STATISTICS TIME ON…

Is there a difference between parse time and compile time? This led me to science time, aka beating up SQL Server and taking its secrets.

First, let’s build a proc that contains a very simple query but a gigabyte of whitespace. Errr, whoops…

Fine, just half a gig of whitespace:

DECLARE @w VARCHAR(MAX) = '        ' --8 spaces

SET @w += @w --2^4
SET @w += @w --2^5
SET @w += @w --2^6
SET @w += @w --2^7
SET @w += @w --2^8
SET @w += @w --2^9
SET @w += @w --2^10 = 1024, 1KB
SET @w += @w --2^11
SET @w += @w --2^12
SET @w += @w --2^13
SET @w += @w --2^14
SET @w += @w --2^15
SET @w += @w --2^16
SET @w += @w --2^17
SET @w += @w --2^18
SET @w += @w --2^19
SET @w += @w --2^20 = 1MB
SET @w += @w --2^21
SET @w += @w --2^22
SET @w += @w --2^23
SET @w += @w --2^24 
SET @w += @w --2^25
SET @w += @w --2^26
SET @w += @w --2^27
SET @w += @w --2^28 
SET @w += @w --2^29 = 512MB
--SET @w += @w --2^30 = 1GB


DECLARE @sql VARCHAR(MAX) = '
CREATE OR ALTER PROC Parsimonious
AS
SELECT TOP(1)'+@w+' *
FROM dbo.Comments
OPTION(RECOMPILE)
'

EXEC(@sql)

This is enough to prove parse time <> compile time, as they show up with different numbers. STATISTICS TIME captures the full duration.

While the execution plan shows compile time as less. (Also, the Extended Events that show compile time matched the execution plan time.)

sys.query_store_query contains columns related to this, with more than just the execution plan compilation number, but those don’t show the full time either.

In summary, parsing is different from compilation, and consumes CPU that doesn’t show up anywhere except STATISTICS TIME. This would make a server spending all its compute on parsing harder to detect, though I imagine it’s rare. I can’t say it never happens though, because I’ve seen things in production. Seen things.

How to Measure Compilation Impact

Of the many instances I’ve seen, it’s a rather rare scenario, where sustained compiles/recompiles contribute substantially to CPU load. The path from clues to conviction was a long one, so buckle up if you want to follow along – this is a lengthier than average post.

First Signs of a Compilation Problem

I’ve found the real-world issue twice, and it’s been two different sources of information that convinced me to look at compilations. The easiest, of course, is the compilations counter. Most monitoring tools will track it for you, but in case you want to check for yourself, here’s a sampling script:

DECLARE @comp TABLE (counter_name varchar(100), cntr_value bigint)

INSERT @comp
SELECT counter_name, cntr_value
FROM sys.dm_os_performance_counters
WHERE counter_name LIKE '%compila%'

WAITFOR DELAY '00:00:30'

SELECT pc.counter_name, (pc.cntr_value-c.cntr_value)/30 AS val_per_sec
FROM sys.dm_os_performance_counters pc
JOIN @comp c ON c.counter_name = pc.counter_name
WHERE pc.counter_name LIKE '%compila%'

Another type of clue actually comes from absence. My experience here is from an Azure SQL DB with high CPU but only moderate compilations. However, the top CPU consuming queries were only accounting for a fraction of observed compute. Let me give an example – suppose you have a 4 vCore SQL DB running at around 80% CPU utilization. You’d expect that to be 80% * (available core seconds) = .8*4*3600 = 11520 seconds of CPU time in the last hour. You can check Query Store to get a sense of how much execution compute there is.

You can also query the query store tables directly to find total CPU, but this is left as an exercise for the reader.

Now, the CPU gap could be coming from several sources missed in a cursory analysis, such as single-use plans that Query Store fails to aggregate. But if you’re only finding a fraction of the CPU you expect, there’s something worth investigating.

Of course, I would be remiss in ignoring the plan cache as corroborating evidence. Huge numbers of single-use plans (or duplicate query hashes) are the thing to look for, but it’s not definitive. Some joker may have decided the best approach to parameter sniffing is to plaster OPTION(RECOMPILE) on everything and keep it from being saved in the cache. Yes I’m bitter.

Proving It

Alright – there’s smoke, but is there fire? This is where it gets hard. Here were my real-world scenarios: a 2014 physical server with 5-10 thousand compilations/sec (!) and a 6 vCore SQL DB with a huge amount of missing CPU in Query Store but only(?) about 50 compiles/sec.

Dead End: Query Store

Query Store actually includes some info about compilation in sys.query_store_query. Of course, this wasn’t even an option for the archaic 2014 instance, but on Azure, I tried checking the most recently executed queries and adding up their compilation impact. Unfortunately, this didn’t work – it was orders of magnitude too low. It might have been somewhat due to an AUTO vs ALL capture mode but….it was still a little bit hacky, and the table clearly wasn’t designed for this.

Also, in the case of my Azure SQL DB, the heavy compilation was regularly knocking Query Store into READ ONLY mode, so when I went to find out the problem, there was no recent history. Oops.

XE Struggles

The answer will have to be found with Extended Events, but this is far from trivial. In 2014, there is a *single* event that can capture all compilation CPU, and it’s a chunky one – I mean, we’re talking morbidly obese, mounds of blubber draped over a WalMart scooter chunky: query_post_compilation_showplan.

It’s even one of a handful that MS stapled a warning to: “significant performance overhead.” And of course there’s no way to exclude the query plan from the output, so with each event you could be dealing with huge XML (especially for those complex queries taking a while to compile).

Oh, and the XE description is wrong as well – cpu_time is in milliseconds, not micro.

derp

Azure Efforts

Still, it was the only option available. I went ahead and tried it for the Azure DB since compilations were lower. I didn’t want to wait weeks for approval to set up a storage account, so I just dumped the output into the ring buffer. Oh look, there’s no way to easily view XE output on an Azure SQL DB…

I didn’t want to leave a high-impact XE running, so I just saved the XML into a file, then used PowerShell to parse and aggregate it. It…worked. Mostly. And was painful. Don’t do it this way. It also turned out that the plans were so large that the ring buffer could only hold a couple seconds worth at a time. The data was still enough to prove a very high compilation impact, and let me send likely culprits to a developer buddy.

XE On-Premise

Given my lukewarm success in Azure, I decided to try a similar approach for my on-premise 2014 server, except I would be able to view live data for the XE, so maybe I could push it out to 10+ seconds for a better sample?

I waited until a low activity time, with “just” a couple thousand compiles a second, fired up the XE, and…

SSMS crashed. I tried an even smaller dataset and other tricks, but SSMS just couldn’t handle the sample size needed. I had to find a new approach.

My nemesis was the enormous query plan…if only I could ignore it. My “Aha!” moment came when I realized I could.

The histogram target bucketizes output by a single attribute, so my cheat was to aggregate on cpu_time for a thirty second sample, then math the total CPU spent compiling. For example, if there were 2 queries that took 10s to compile, and 10000 that took 20ms, this would mean that 2*10+10000*.02 = 220s were spent compiling, or 7.3 cores on average in a 30s sample. Here’s the script I came up with. (Special thanks to Kendra Little for the TSQL to query a histogram)

--CREATE EVENT SESSION [Comp_hist] ON SERVER
--ADD EVENT sqlserver.query_post_compilation_showplan
--ADD TARGET package0.histogram(SET filtering_event_name=N'sqlserver.query_post_compilation_showplan',slots=(1024),source=N'cpu_time',source_type=(0))
--WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_MULTIPLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=OFF,STARTUP_STATE=OFF)
--GO

SET XACT_ABORT ON

IF OBJECT_ID('tempdb..#hist') IS NOT NULL DROP TABLE #hist
--DROP TABLE IF EXISTS #hist

ALTER EVENT SESSION [Comp_hist] ON SERVER  
STATE = start;  
GO  

DECLARE @start DATETIME2(1) = GETDATE()

WAITFOR DELAY '00:00:30'

--thanks kendra!
SELECT
    xed.slot_data.value('(value)[1]', 'int') AS cpu_ms,
    xed.slot_data.value('(@count)[1]', 'int') AS slotcount
INTO #hist
FROM (
    SELECT
        CAST(xet.target_data AS xml)  as target_data
    FROM sys.dm_xe_session_targets AS xet  
    JOIN sys.dm_xe_sessions AS xe  
       ON (xe.address = xet.event_session_address)  
    WHERE xe.name = 'Comp_hist'
        and target_name='histogram'
    ) as t
CROSS APPLY t.target_data.nodes('//HistogramTarget/Slot') AS xed (slot_data);

DECLARE @end DATETIME2(1) = GETDATE()

ALTER EVENT SESSION [Comp_hist] ON SERVER  
STATE = stop;

SELECT SUM(cpu_ms*slotcount)*1.0/DATEDIFF(millisecond,@start,@end) AS [Avg cores compiling]
FROM #hist

The script was an utter success – not only was the overhead minimal, it captured the data I needed to prove that our largest, priciest server was spending 40% of its compute on plan generation.

Happy Ending

My developer contact really ran (sprinted?) with the info from the Azure DB, and fixed the issues behind the biggest compilation offenders (from Entity Framework). Average CPU dropped from 70% to 20%, such that we can scale down and save money. For the on-premise server, the 40% number was enough to convince developers to don their hazmat suits and make changes to the problematic legacy code. I haven’t heard anything back yet, which might be because the legacy code ate them – it’s that bad.

Better Solution?

I’m still not content with my current approach. For one, Azure SQL DB doesn’t allow histogram targets, so it just doesn’t work there. Also, the XE I use is normally high impact, so I’m still wary of using it in a histogram. After some digging, I found a Debug XE (remember, Debug is the type you have to specifically check in SSMS) that looks promising: sql_statement_post_compile. I’m not sure when it was added, but I see it for Azure and 2019, but not 2014. Next time I’m checking Azure, I’ll test that one out instead.

So in summary, use sql_statement_post_compile if you have access to it, use a histogram if you’re not Azure SQL DB, and don’t let EF generate giant IN lists of hardcoded values that take 300ms to compile. Ick, why did I end with something about EF? Happy thoughts. Happy thoughts! WHY EF WHYYYYYYY????!!!!!!