Missing Index Hints and the Abominable Rebuild

“It’s a very short blocking phase” they said. “Metadata only” they said.

Meanwhile, here’s my online rebuild of a zero row table.

What database doing?

Yes, it’s been running for 30s, blocking the whole time. For extra fun- it’s unkillable.

Findings first: SQL Server stores details about missing index suggestions. The dmv pertinent to screwing up index operations is sys.dm_db_missing_index_group_stats_query. Should this DMV accumulate a very large number of entries, any index create/rebuild on any table with a missing index hint will take extra time to process those entries. Yes, that’s any table anywhere – doesn’t matter if it’s in a separate database with a single hinted query.

Troubleshooting

The prod version of this issue manifested as 1-3 minutes of SCH-M blocking for index ops, yet only on a single server; I couldn’t replicate it anywhere else. Of course, bad blocking meant nasty timeouts, and that in turn meant I was reluctant to run live diagnostics. Better to just stop index rebuilds. One does not take down production in order to test weird SQL behavior.

An important clue came from a tiny table with a rebuild, and is visible in the whoisactive screenshot above. Why would such a small rebuild need so much CPU? That worker was definitely doing something – the question was what.

My lucky break came with a bad proc that needed emergency index help – in this case the fix was worth the blocking, and the table was non-critical anyways. I readied the index creation script, and prepared an extended event to capture callstacks on waits and spinlock backoffs for my spid.

And…jackpot. The worker was hitting SOSY waits and for 3 minutes every callstack was topped with:

sqllang!CQSIndexStatsMgr::DropIndexStatsWithWildcards
sqllang!CQSIndexRecMgr::InvalidateMissingIndexes
sqllang!QPEvent::CreateIndex
sqllang!CIndexDef::CreateIndex

Missing indexes huh? I started poking around the DMVs, and discovered that sys.dm_db_missing_index_group_stats_query took several minutes to query – it was even unkillable, just like the index rebuilds.

Repro

I finally had enough clues to cause the issue on purpose. I’m going to show you how to do this at home, because breaking SQL Server is fun…in a test environment.

First, let’s create a table. No rows necessary, just tell SQL Server that there are.

drop table if exists dbo.RebuildTest

create table dbo.RebuildTest (
	ID int identity primary key,
	Num int,
	Filler char(100)
)

UPDATE STATISTICS dbo.RebuildTest WITH ROWCOUNT = 1000000, PAGECOUNT = 10000

Now we need to generate missing index hints….lots of them. After some experimenting, I combined two features known to play nicely with caching: temp tables and dynamic SQL. (Dear junior DBA reading this: the prior sentence is sarcasm. Those features are notorious for caching issues.)

SET NOCOUNT ON

DECLARE @sql NVARCHAR(4000)
DECLARE @stop DATETIME = DATEADD(MINUTE,60,GETDATE())

WHILE GETDATE() < @stop
begin
	
	SET @sql =
	'select *
	into #t'+CONVERT(CHAR(36),CONVERT(BINARY(16),NEWID()),1)+'
	from dbo.RebuildTest
	where Num = (select -1)'

	EXEC(@sql)

end

I ran that from ten sessions for a really long time – I think I accumulated about 15 core hours of spam. Rinse and repeat as needed to get the desired slowdown on your own system.

Consequences of above pollution:

Over nine million!

Let’s create an unrelated table, also zero rows.

drop table if exists dbo.OtherTable

create table dbo.OtherTable (
	ID int identity primary key,
	Num int,
	Filler char(100)
)

UPDATE STATISTICS dbo.OtherTable WITH ROWCOUNT = 1000000, PAGECOUNT = 10000

If you rebuild now, it will be super fast.

Run a single query that generates a missing index hint however…

select *
from dbo.OtherTable
where Num = (select 42)

And your rebuild transforms into a lumbering unkillable monstrosity

ALTER INDEX ALL ON dbo.OtherTable REBUILD WITH(ONLINE=ON,MAXDOP=1)

--need to tell SQL fake stats again if you want to repeat the tests
UPDATE STATISTICS dbo.OtherTable WITH ROWCOUNT = 1000000, PAGECOUNT = 10000

Resolution

Ironically enough, this can be fixed by, ugh, restarting SQL Server, adding the appropriate index, or rebuilding the table behind the missing index spam. Due to the nature of the pollution, none of the SQL handles from the DMV resolved to text, but I was able to work up a way to see which objects were behind the growth.

SELECT TOP 10 mid.database_id, mid.object_id, COUNT(*) c
FROM sys.dm_db_missing_index_details mid
JOIN sys.dm_db_missing_index_groups mig
ON mig.index_handle = mid.index_handle
JOIN sys.dm_db_missing_index_group_stats_query migs
ON migs.group_handle = mig.index_group_handle
GROUP BY mid.database_id,
         mid.object_id
ORDER BY c desc

There’s another detail I haven’t figured out: for my production system, the DMV had 3 million entries and was adding minutes of extra time. For my laptop, 9 million was about 45 seconds. I’m kinda curious where the discepancy comes from, but not enough that I’m willing to spend hours melting my personal machine or digging through disassembly. Besides, disassembly problems belong to Paul White the magical SQL hermit on a remote island who solves weird code issues for booze. and who I secretly aspire to be if I grow up

It’s probably a rare issue to hit as well, needing a server that hasn’t been restarted in a long enough time and unique queries that both generate a hint and don’t get combined with similar SQL. I mean, how ridiculous would it be for people to dynamically fill global temp tables named with guids? Nobody does that, right? Crazytown.

But this is the way of things – do weird stuff in SQL, get weird issues. At least weird issues lead to entertaining blog posts, which I sincerely hope does not incentivize you to keep doing the weird stuff. Especially Service Borker. Seriously, knock that out.

Thoughts on RLS for Data Migration

The common corporate amoeba floats around, grasping at revenue, customers, and other nutrients with ciliated pseudopods. You, dear gainfully-employed DBA, are an organelle in this subsentient blob. Sometimes your employer-host catches another amoeba, and it becomes your duty to ingest their delicious data.

This was the situation I found myself in earlier this year – our company had absorbed another, and it was time to slurp up their tables. There were a lot of decisions to make and tradeoffs to weigh, and we ended up choosing to trickle-insert their data, but make it invisible to normal use until the moment of cutover.

The way we implemented this was with Row Level Security. Using an appropriate predicate, we could make sure ETL processes only saw migrated data, apps saw unmigrated data, and admins saw everything. To give a spoiler: it worked, but there were issues.

Let’s consider a simplistic RLS filter, with the StackOverflow2010 database for examples.

CREATE FUNCTION dbo.RLSfunctionSimple(@Migrated AS bit)
    RETURNS TABLE
WITH SCHEMABINDING
AS
    RETURN 
	SELECT 1 AS Pass
    WHERE (@Migrated = 1 AND USER_NAME() IN ('ETLProcess','BubbaTheAdmin'))
	OR (@Migrated = 0 AND USER_NAME() NOT IN ('ETLProcess'))
GO

Great! This works. We add an IsMigrated column to the Users table, set the security policy, and bam, we’re done. Yeah no.

ALTER TABLE dbo.Users 
ADD IsMigrated BIT NOT NULL DEFAULT(0)

CREATE SECURITY POLICY RLS_Users
ADD FILTER PREDICATE dbo.RLSfunctionSimple(IsMigrated) 
ON dbo.Users
WITH (STATE = ON);

More people than just Bubba want to see migrated data. A migration is a big deal, and everyone, everyone, is going to ask to see migrated data for reasons that are honestly quite legitimate. You will constantly be modifying who gets access.

There’s another issue: adding RLS to a table takes a SCH-M lock on that table. Have you experienced the joys of frequent SCH-M locks on critical tables in a highly active database?

So out of a desire to easily make access changes and avoid SCH-M locks (and because you can’t alter the function while it’s used by the security policy), you decide to make the RLS function reference tables. Let’s add one table for the ETL processes that should only see migrated rows, and one table for the folks who want to read everything. Everyone else by default only views unmigrated data.

CREATE TABLE dbo.SeeEverything (
UserName NVARCHAR(256) PRIMARY KEY
)
CREATE TABLE dbo.SeeMigrated (
UserName NVARCHAR(256) PRIMARY KEY
)

And here’s what the RLS function might look like:

CREATE FUNCTION dbo.RLSfunctionComplex(@Migrated AS bit)
    RETURNS TABLE
WITH SCHEMABINDING
AS
    RETURN 
	SELECT 1 AS Pass
    WHERE (@Migrated = 1 
		AND EXISTS (SELECT 1 FROM dbo.SeeMigrated WHERE UserName = USER_NAME())
		OR EXISTS (SELECT 1 FROM dbo.SeeEverything WHERE UserName = USER_NAME())
		)
	OR (@Migrated = 0 
		AND NOT EXISTS (SELECT 1 FROM dbo.SeeMigrated WHERE UserName = USER_NAME())
		)

And now this

SELECT DisplayName
FROM dbo.Users
WHERE ID = 12345

Becomes this

RLS that references tables will break queries. Sometimes, the added complexity just made the optimizer fall over. Other times useful query transformations broke.

Note that the above query plan has nine extra nodes. Imagine a relatively complex query, where most of the involved tables have RLS. Let me tell you, things break. You might say, oh don’t worry, we’ll only add to core tables. No, that’s not how it will work. Bad data is expensive. The business, at least during a migration, cares about data quality. If you have RLS functioning fine on N tables, there will be pressure to make it N+1, just to be safe.

There’s a specific and common transformation that breaks when RLS reads other tables, the TOP-MAX transformation.

Here’s the normal version, at 0ms.

SELECT MAX(ID)
FROM dbo.Users

And here it is after RLS, scanning your tables and taking 116ms

And this is only part of the plan

Any place you rely on the TOP-MAX transformation, you will have to rewrite as an explicit TOP 1.

Recommendations

Try as hard as you reasonably can to avoid external references in your RLS function. We still ended up with one – we landed on a single look-up-everything table, and it didn’t break as much as other multi-table functions did.

Expect your coverage to grow beyond initial estimates, whether users who need access or tables that need protection.

You can hot-swap your RLS function with ALTER SECURITY POLICY. It still takes a SCH-M lock on the affected table, but it’s a lot better than dropping your security policy in order to alter the function.

All-in-all, I’m mostly ok with our decision. It meant a lot of work ahead of time (including adding the migration column to tables – don’t forget to include in the indexes as well). But when the migration day came, we turned RLS off, and everything just worked. Given typical cutover craziness, that’s not a bad tradeoff to make.

A Distinct (Query) Murder Mystery

A #sqlhelp member runs through the Slack channel, “Help! Help! My insert has been killed!” Bystanders gather around the still-warm body to stare at the soulless consolas note fixed to its chest.

Rough way to go, that.

Clamor erupts immediately. Accusations fly. Anyone remotely resembling a butler starts edging away. The professionals wait for evidence, but what they see next raises a line of eyebrows.

INSERT [Target] (Id)
SELECT DISTINCT Id
FROM [Source] s
WHERE SomeCondition = 1
AND NOT EXISTS (
	SELECT 1
	FROM [Target] t
	WHERE t.Id = s.Id
	)

“A duplicate despite a DISTINCT bodyguard…” one mutters. “This isn’t the normal level of foul play.”

Mods arrive to cordon off the scene. Twelve different butlers have already been accused, but each has a solid alibi, even MAXDOP.

Generic Protagonist paces back and forth. They’re waiting for the forensics report. Finally it arrives.

“Intriguing.” Protagonist mutters under their breath. Hapless Policeman #5 is forced to listen to an analysis. “There are two possible culprits here: a concurrent session inserting the duplicate value after this one has performed the NOT EXISTS check, or the query is pulling duplicate values somehow from the Source table.”

Generic leans closer and puts on their glasses. “And the DISTINCT bodyguard is missing! No operator here performs that duty, even though the query requested a protection detail.” A quick check with the witness confirms that the column has a unique constraint on the Source table. “Looks like Concurrent Inserts just became our number one suspect.”

A lone, sputtering lightbulb hangs over a table. “Did you do it?” Generic Protagonist asks from opposite Concurrent Insert Theory. “Did you insert a row into the target table while another query was trying to insert the same row?”

Being a barely-anthropomorphized query pattern, it says nothing. Protagonist scoffs, “I knew you’d say that. That’s what they all say. But what about…EVIDENCE!”

Generic Protagonist is unable to contain themselves and starts pacing the small room. “We both know that Serializable Isolation Level is required to prevent new rows entering the queried range before a transaction is finished. The range locks on the target table will last until the query is finished, preventing problematic inserts and stopping your duplicate PK crime spree.”

Protagonist waves a sheet of paper, “And guess what I just had the lab check? That’s right, Serializable stops the duplicate PK error.” Concurrent Insert Theory says nothing. Generic leans back smugly, “I thought so. Guards! Take it away.”

Generic is about to file away the notes when a witness walks in. “More evidence for you,” he says, “Repeatable Read also prevents the duplicate insert.” Protagonist freezes, processing. “But Repeatable Read only holds locks on existing rows until the end of a transaction – it doesn’t prevent new inserts…” A pause follows. “I may have made a terrible mistake.”

It takes a late night call and some favors, but a lab tech is found who can run another test. What about a test version of the query on a new target that is guaranteed to run by itself? It fails. It fails! They thank the lab tech, who responds, “Of course! Anything when a barely-anthropomorphized query pattern is at stake!” It is not Concurrent Inserts. It must be Duplicates from Source. But how?

Generic Protagonist turns to the witness. What isolation level was the query running under? “Read Committed” is the answer. “Ah,” Generic replies, “are you aware of the criminological paper, Read Committed is a Garbage Isolation Level? I think I know how the murder happened.”

Various suspects, witnesses, and even a butler or two are gathered into a room. “I’m sure you’re wondering why I called you here.” Generic proclaims. “Actually no,” says one of the butlers, “every generic detective has a scene where they reveal the culprit to a room of people. Just get to the point.” Protagonist sighs, and gestures to two attendees. “Fine. They did it. It was Duplicates from Source along with its accomplice, Migrating Data.”

They continue, “Read Committed is what allowed them to enact the crime. You see, there were concurrent updates in the Source table that would move rows around, causing some to be read twice. And no operator enforced distinct results from the query, because there already was a constraint and the engine didn’t think it was necessary.”

Duplicates from Source stares blankly as officers take it away, destined to spend the rest of its days under Repeatable Read.

Satisfied with a case well-solved, Generic Protagonist adds a final entry to the case file, titled “Evidence: Repro”

————————————————————-

Create a copy of the Users table from StackOverflow2013 in your test database, a target table, and an index.

USE TestDB

CREATE TABLE dbo.UsersCopy
(
[Id] [int] NOT NULL PRIMARY KEY CLUSTERED,
[AboutMe] [nvarchar] (max) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Age] [int] NULL,
[CreationDate] [datetime] NOT NULL,
[DisplayName] [nvarchar] (40) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[DownVotes] [int] NOT NULL,
[EmailHash] [nvarchar] (40) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[LastAccessDate] [datetime] NOT NULL,
[Location] [nvarchar] (100) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Reputation] [int] NOT NULL,
[UpVotes] [int] NOT NULL,
[Views] [int] NOT NULL,
[WebsiteUrl] [nvarchar] (200) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[AccountId] [int] NULL
)
GO

CREATE INDEX IX_UsersCopy_SomeColumns
ON dbo.UsersCopy (Age, DownVotes)
INCLUDE (CreationDate)
GO

CREATE TABLE dbo.UsersTarget (
[Id] [int] NOT NULL PRIMARY KEY CLUSTERED
)
GO

INSERT dbo.UsersCopy
SELECT *
FROM StackOverflow2013.dbo.Users
GO

--is rcsi off?
USE [master]
GO
ALTER DATABASE [TestDB] SET READ_COMMITTED_SNAPSHOT OFF WITH NO_WAIT
GO

Scanning the index takes ~80ms. Let’s use WAITFOR TIME statements to run an update 10ms after the insert started reading from the source table.

WAITFOR TIME '06:16:25.462'

INSERT dbo.UsersTarget
SELECT DISTINCT ID
FROM dbo.UsersCopy u
WHERE CreationDate = '2008-07-31 21:57:06.240'
AND NOT EXISTS (
	SELECT 1
	FROM dbo.UsersTarget ut
	WHERE ut.Id = u.Id
	)
WAITFOR TIME '06:16:25.472'

UPDATE dbo.UsersCopy
SET DownVotes = 1000000
WHERE ID = 10

And a helper script for the WAITFORs

DECLARE @dt TIME(3) = DATEADD(SECOND,40,SYSDATETIME())

PRINT 'WAITFOR TIME '''+CONVERT(VARCHAR(40),@dt)+''''
PRINT 'WAITFOR TIME '''+CONVERT(VARCHAR(40),DATEADD(MILLISECOND,10,@dt))+''''

Cleanup/reset in case you missed something or want to test multiple times.

UPDATE dbo.UsersCopy
SET DownVotes = 0
WHERE ID = 10

truncate table dbo.UsersTarget

The prosecution rests.

Left and Right Deep Hash Joins

There’s a lot already written about left versus right deep hash joins. Unfortunately for us SQL Server nerds, “left” and “right” don’t make as much sense in SSMS query plans – those have a different orientation than the trees of database theory.

But if you rotate plans, you can see left and right that make sense (even if they still don’t match canonical shapes). Just follow the join operators.

The great thing about left deep plans is that only a couple of hash tables need to exist at a time, saving memory.

Whereas a right-deep hash join plan needs to keep all the hash tables around until it’s done.

So why would SQL Server ever choose a right-deep plan? There are two key ideas to this. First, each shape has just one privileged position, a single table that is only part of probes, never needing to fill up a hash table, never adding to the memory overhead. Review the two plans above – can you spot which in each never contributes to a build?

Second, SQL Server has to pick just two tables to start with (more than two is an n-ary join, which may theoretically exist). Certain join patterns let you choose any table to start with, while some severely limit choices. Star joins, a common data warehouse pattern, require the central table to be part of the initial join.

I had a vague intuition of this part, but drawing it out really helped me – here are three basic patterns I tend to see:

First, an N-ary join, where all tables join on an equivalent value.

SELECT *
FROM Users A
JOIN BannedUsers B
   ON A.UserID = B.UserID
JOIN Donors C
   ON A.UserID = C.UserID
JOIN Achievements D
   ON C.UserID = D.UserID

Then there are what I call chained joins (I assume there’s an official name, but I don’t know it).

SELECT *
FROM Users A
JOIN Vacations B
   ON A.UserID = B.UserID
JOIN Itineraries C
   ON B.VacationID = C.VacationID
JOIN Destinations D
   ON C.DestinationID = D.DestinationID

Finally, and most importantly for this topic, there are star joins, where everything relates back to a central table.

SELECT *
FROM UserFacts A
JOIN NameDim B
   ON A.NameID = B.NameID
JOIN AddressDim C
   ON A.AddressID = c.AddressID
JOIN DemographicsDim D
   ON A.DemographicsID = D.DemographicsID

In a star join, that central table is usually massive. Billions of rows into a hash table is typically a Bad Idea. And since all joins involve that table, SQL Server can’t delay it until the privileged position of a left deep plan. Instead, there will be a right deep hash join plan, and the massive table will be placed in the privileged position that never fills up a hash table.

So right-deep hash join plans exist because sticking gajillions of rows into a hash build tends to go poorly, and the optimizer picks a right deep plan to avoid blowing out your tempdb. Hurray for the optimizer!…this time.

The Top/Max Transformation

Behold, a simple MAX query:

But wait, where did that Top operator come from? Oh, SQL Server, you clever rascal…

Simple and extremely effective – a solid query transformation. But can you handle more complex versions, SQL Server?

Let’s try the bane of ordered scans: partitioned tables.

Please note that all repro scripts are shared at the bottom

Partitioning

The transformation works if the column is both the leading index key and the partitioning key.

But fails if the leading key is not the partitioning key.

I guess that makes sense to me, because it would have to grab the top value from each partition then sort those, a much more complicated transformation.

Second Column

Can SQL Server use the Top/Max transformation when looking at the second column of an index? In theory it should work.

Time for our favorite DBA sing-along song, It Depends.

Here’s the table, clustered on c1, c2, loaded with some filler data as seen in the repro script at the end.

CREATE TABLE dbo.A(
c1 INT NOT NULL, 
c2 INT NOT NULL,
INDEX CX_somecreativename CLUSTERED (c1,c2)
)

The simplest form works

SELECT MAX(c2)
FROM dbo.A
WHERE c1 = 3

What if we make it more complicated? Let’s add a basic temp table.

CREATE TABLE #3 (three INT NOT NULL)
INSERT #3 VALUES (3)

This works…

SELECT MAX(c2)
FROM A
WHERE c1 = (SELECT TOP(1) 3 FROM #3)

But this doesn’t

SELECT MAX(c2)
FROM A
WHERE c1 = (SELECT TOP(1) three FROM #3)

Spot the difference? Pulling a column value from the second table breaks the transformation, but pulling a constant keeps it. Weird. Anyways, let’s try a rewrite (though inexact) of the query.

SELECT the_max
FROM (SELECT TOP(1) three AS c3 FROM #3) s
CROSS APPLY (
	SELECT MAX(c2) AS the_max
	FROM dbo.A
	WHERE A.c1 = s.c3
	) x

Victory again! Wait, what’s that? A change to the table definition making the second column nullable?

ALTER TABLE dbo.A
ALTER COLUMN c2 INT NULL

Ok fine, I’m sure the last query won’t…oh.

So the TOP/MAX transformation is finicky on a second column, depending on nullability [is that a word?] and how the query is written.

Going Backwards

There’s even a situation where SQL Server should (should!) be able to do a reverse transformation, a TOP into a MAX. Columnstore indexes store maximum values per segment, so a TOP(1) query could simply grab that!

Conclusion

I can’t tell you how tempted I was to call this MAXTOP. Thinking about it still causes stray giggles to bubble up, as I gleefully envision sharing this info with eager young DBAs at a SQL Saturday, then ushering them off to a session on parallelism. Thankfully I’m not that evil.

The real conclusion is that SQL Server is programmed to be very clever. Despite the cleverness though, details matter, and the cleverness often falls short. What’s that warm glow I feel inside? Ah yes, job security.

Repro scripts

USE TestDB
GO

CREATE PARTITION FUNCTION pf_hundreds(int)
    AS RANGE RIGHT FOR VALUES (100,200,300,400,500,600,700,800,900);

CREATE PARTITION SCHEME ps_hundreds
AS PARTITION pf_hundreds
ALL TO ( [PRIMARY] );
GO

DROP TABLE IF EXISTS PartitionedNumbers
GO

CREATE TABLE PartitionedNumbers (
Number1 INT NOT null, 
Number2 INT NOT NULL, 
CONSTRAINT PK_numnum PRIMARY KEY CLUSTERED(Number1,Number2))
ON ps_hundreds(Number1) --drop/create table partitioned on Number2 for second query

INSERT PartitionedNumbers
SELECT TOP(999) ROW_NUMBER() OVER(ORDER BY 1/0), ROW_NUMBER() OVER(ORDER BY 1/0)
FROM master.dbo.spt_values

SELECT MAX(Number1)
FROM PartitionedNumbers
USE TestDB
GO

DROP TABLE IF EXISTS dbo.A
DROP TABLE IF EXISTS #3
go

CREATE TABLE dbo.A(
c1 INT NOT NULL, 
c2 INT NOT NULL,
INDEX CX_somecreativename CLUSTERED (c1,c2)
)

INSERT dbo.A
SELECT TOP(100000) 
(ROW_NUMBER() OVER(ORDER BY 1/0))%100,
(ROW_NUMBER() OVER(ORDER BY 1/0))%500 + 250
FROM sys.columns a, sys.columns b, sys.columns c

CREATE TABLE #3 (three INT NOT NULL)
INSERT #3 VALUES (3)

--works
SELECT MAX(c2)
FROM dbo.A
WHERE c1 = 3

--works
SELECT MAX(c2)
FROM dbo.A 
WHERE c1 = (SELECT 3)

--works
SELECT MAX(c2)
FROM A
WHERE c1 = (SELECT TOP(1) 3 FROM #3)

--does not work
SELECT MAX(c2)
FROM A
WHERE c1 = (SELECT TOP(1) three FROM #3)

--works
SELECT the_max
FROM (SELECT TOP(1) three AS c3 FROM #3) s
CROSS APPLY (
	SELECT MAX(c2) AS the_max
	FROM dbo.A
	WHERE A.c1 = s.c3
	) x


--change things around
ALTER TABLE dbo.A
ALTER COLUMN c2 INT NULL

--does not work
SELECT the_max
FROM (SELECT TOP(1) three AS c3 FROM #3) s
CROSS APPLY (
	SELECT MAX(c2) AS the_max
	FROM dbo.A
	WHERE A.c1 = s.c3
	) x

No Running at the Resource Pool

Ok, I get it, scheduling queries can be complicated. See this and this and maybe this and this too but only if you have insomnia. I still thought I kinda understood it. Then I started seeing hundreds of query timeouts on a quiet server, where the queries were waiting on…what? CPU? The server’s at 20%, why would a query time out after a minute, while only getting 5s of CPU, and not being blocked?

It was Resource Governor messing up one of my servers recently in a way I never expected. Apparently RG will try to balance *total* CPU usage for pools, to the point that a single query sharing a scheduler will be completely starved.

Uh, this might be easier to explain with a picture:

And a demo of course:

Let’s set up some Resource Pools with no constraints. (Thanks to this blog for making some easy-to-copy scripts). Again, these pools don’t limit anything – the only thing they do is exist. Behold, pools Apple and Banana:

USE master
GO
 
CREATE RESOURCE POOL [Apples] WITH(
		min_cpu_percent=0, 
		max_cpu_percent=100, 
		min_memory_percent=0, 
		max_memory_percent=100, 
		AFFINITY SCHEDULER = AUTO
)
GO
 
CREATE RESOURCE POOL [Bananas] WITH(
		min_cpu_percent=0, 
		max_cpu_percent=100, 
		min_memory_percent=0, 
		max_memory_percent=100, 
		AFFINITY SCHEDULER = AUTO
)
GO

CREATE WORKLOAD GROUP [Apple] 
USING [Apples]
GO
CREATE WORKLOAD GROUP [Banana] 
USING [Bananas]
GO

CREATE FUNCTION dbo.Fruit_Inspector() 
RETURNS SYSNAME WITH SCHEMABINDING
AS
BEGIN
  DECLARE @workload_group sysname;
 
  IF (program_name() LIKE 'A%')
      SET @workload_group = 'Apple';
  IF (program_name() LIKE 'B%')
      SET @workload_group = 'Banana';
     
  RETURN @workload_group;
END;
GO

ALTER RESOURCE GOVERNOR WITH (CLASSIFIER_FUNCTION = dbo.Fruit_Inspector);
GO

ALTER RESOURCE GOVERNOR RECONFIGURE
GO

Now for a little bit of SSMS trickery: you can use connection properties to set an application name of your choice, assigning your connection to a resource pool.

Check which scheduler you’re on (and your session id, workload group, and app name) with this query:

SELECT er.scheduler_id, wg.name AS wg_name, er.session_id, es.program_name
FROM sys.dm_exec_requests er
JOIN sys.resource_governor_workload_groups wg
ON wg.group_id = er.group_id
JOIN sys.dm_exec_sessions es
ON er.session_id = es.session_id
WHERE er.session_id = @@SPID

Get a session ready for Resource Pool Apple – it doesn’t matter what scheduler it’s on. Place an infinite-loop CPU-sucking query in it, ready to go. I don’t start it until later.

DECLARE @trash bigint

WHILE 1=1 
BEGIN
	SELECT @trash = COUNT(*) FROM dbo.Votes
end

Now get sessions for Pools Apple and Banana that happen to land on an identical scheduler (I did this by opening a bunch of Apple sessions, then reconnecting the Banana session until I got a match).

Start the other infinite-loop query, then run a high-CPU query on each of colliding sessions – here’s what I used:

SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
OPTION(MAXDOP 1,USE HINT('QUERY_OPTIMIZER_COMPATIBILITY_LEVEL_140'))

Remember we have two queries running from Pool Apple, and one from Pool Banana, where the Banana query is on the same scheduler as an Apple query. Here’s the Banana runtime – 15s of pure CPU:

But here’s the Apple runtime:

Apple took 30s because it wasn’t allowed *any* CPU time while Banana was running. Because SQLOS was trying to balance the resource pools with the infinite query running, the Apple query was waiting on SOS_SCHEDULER_YIELD for the whole Banana query, which I do think is absolutely bananas.

Remember the Banana CPU time was 15180

You’ll even see super-long scheduler yield waits with this pattern.

If the totals look off, that’s because I tested multiple times in these sessions

Yes, seriously, waiting 5 seconds to get time on the CPU.

Please note this doesn’t even need to be a heavy query from Pool A:

“Working as intended” – Top Support Man

In summary, have the doodle again:

In other summary, Resource Governor is scary, I blame it for hundreds of avoidable timeouts, and I hope MS improves their documentation.

P.S. If you followed my demo, that infinite loop is still running – might wanna kill it. Even worse, RG is still around, which you can fix with the below:

ALTER RESOURCE GOVERNOR DISABLE

Zombie Transactions Ate my Database

A failover that needs five minutes for the secondary database to come online is a problem, especially if it generates stack dumps. Thankfully, Microsoft has Top Men, and with their help digging into source code, we uncovered…well, something neither of us had ever seen before. I’m naming them Zombie Transactions, because then I can pretend my job is more glamorous than clicking at a screen all day.

Why the Long Failover?

Every time there’s a failover, SQL Server must run the recovery process. The standard Analysis, Redo, and Undo occur, and even sometimes an extra step. In our case, it was Redo taking a ridiculously long time. According to XE, SQL Server was searching through the log for 25,000 active transactions every recovery (and incidentally producing stack dumps because this took so long).

Except these transactions weren’t really live, they were dead, shambling about our system, eating our performance. They didn’t come from normal activity, they didn’t show up in dmvs, and they couldn’t be killed with a failover.

I eventually found where they were lurking by using fn_dblog – they were in our checkpoints. Every checkpoint needs to include active transactions – it’s part of the design that allows a checkpoint to occur without freezing your whole database. And for some reason, our database was including this horde of undead transactions in each checkpoint.

Unresolved Non-DTC Cross Database…What?

This is where Microsoft support was helpful (you know, Top support, not the level 1s you have to put up with for the first month of an issue). The checkpoint entries were identified as “unresolved non-DTC cross-database transactions.” This awkward tongue-twister is the other reason I’m calling them zombies.

In other words, these are leftovers from transactions spanning multiple databases with an AG (for example, a trigger that logs changes to a separate non-AG database). If the transaction doesn’t complete correctly, it can leave an unresolved zombie behind. A minimal repro succeeded at generating a couple with a mid-transaction failover, though we never figured out how exactly 25k came about.

Zombie Hunting

I got tired of checking manually, so I wrote a script to search for zombies. It uses fn_dblog, which is undocumented and comes with caveats. I recommend running on a secondary.

USE [MyDB]
GO
CHECKPOINT
go
DECLARE @base bigint
declare @desc table (descr varchar(500))

select top(1) @base = CONVERT(bigint,CONVERT(varbinary(6),'0x'+REPLACE([Transaction ID],':',''),1))
from sys.fn_dblog(null,null)
where [Transaction ID] <> '0000:00000000'
insert @desc
select Description
from sys.fn_dblog(null,null)
where Operation = 'LOP_XACT_CKPT'

select COUNT(case when xacts.xval < @base then 1 end) as zombies, COUNT(*) as total
from (
    select CONVERT(bigint,CONVERT(varbinary(6),'0x'+x.value,1)) xval
    from (
        select REPLACE(REPLACE(descr,'XdesIDs: ',''),':','') xdes
        from @desc
    ) l
    cross apply STRING_SPLIT(l.xdes,',') x
) xacts
where xval <> 0

How to Fix [Everything with Computers]

There was some back and forth with Microsoft on how to remove these – I didn’t want to edit memory, and restarting the server didn’t help. However, I quickly discovered that setting the database offline then online fixed it. That’s right, turning it off then on again was a solution.

After some more experimenting, we figured out that setting the database to Read-Only then back would fix it, and was faster than offline/online. A teammate also discovered that reseeding is necessary, since the zombies would reappear if you fail over. He built an awesome script for doing this, and I’m sure he would be willing to help for entirely reasonable rates™. Our failover time went from five minutes to ten seconds after we cleared the zombies.

Summary

You might have zombie transactions if you’ve been running an AG without the DTC setting. If you do, and there are a lot, they can slow down failovers. To kill them off, you’ll have to set the database to read-only mode, or bring offline-online. Yes, this requires downtime.

Please let me know if you find them in your system – seriously, let me know, and let Microsoft know. That way there’s a chance of us getting a way to fix without needing downtime.

Fixing Queues with Watermarks

Queues are pretty common, but it’s not too often I encounter them in SQL Server. There’s a reason: SQL Server is bad at them. I’m not just talking about Service Borker – even native table implementations crumble under heavy load.

The standard pattern looks something like below – allowing multiple threads to pull a work item from the single queue table.

WITH CTE AS 
    (SELECT TOP (1) *
    FROM dbo.testq WITH (ROWLOCK, READPAST) --locking hints allow concurrency
    ORDER BY ID)
DELETE CTE
OUTPUT Deleted.filler
INTO @d --table variable to store the stuff to process

I recently had the pleasure of investigating a procedure that pulled from a queue. Normally it was fast, but occasionally runtime would spike. The spooky thing was the query was using an ordered index scan that should only read one row, but during the spikes it was reading thousands.

Surely there’s a rational explanation…

…what to blame when your queue table has a bad hair day

Seriously. In the standard implementation of a queue table, dequeued records are deleted. But SQL Server typically doesn’t remove deleted rows immediately. Instead it marks them as ghost records (aka soft deletes), and exorcises them later, which is why a TOP(1) can easily read a lot more than one row.

So when the dequeue finds the next record to work on, it scans from the start of the index through ghosted rows to reach it. This isn’t a problem if there are only a couple, but sometimes servers can be so busy that Ghost Record Cleanup can’t keep up, or the ghost records are needed for versioning (which can easily happen if you use AGs). In those cases, the number of ghosts grows and grows, and you end up haunted by bad performance.

It was an intriguing problem, and I worked like a man possessed. I tested some bizarre approaches to shift rows around or force cleanup without blocking parallel dequeues, but what ended up succeeding was a watermark pattern. In a separate table, I store the minimum identity processed. Then, every dequeue can range scan starting from the watermark.

In the original implementation, doubling the number of rows to dequeue would quadruple the runtime. 30k rows took 2:50, while 60k rows took 10:01. For the watermark pattern however, 30k rows dequeued in :23, and 60k in :48. A clear win!

Here’s roughly what my implementation looked like. If you want to test for yourself, you can force ghost records (specifically version ghosts) by leaving a select transaction open with snapshot isolation enabled. I then spammed dequeues in while loops with the help of SQLQueryStresser.

First, the tables:

DROP TABLE IF EXISTS dbo.testq;
DROP TABLE IF EXISTS dbo.qtrack;

CREATE TABLE dbo.testq
(
    ID INT IDENTITY PRIMARY KEY,
    inserttime DATETIME2(2) CONSTRAINT DF_qq2 DEFAULT GETUTCDATE(),
    filler CHAR(400)
)

CREATE TABLE dbo.qtrack --watermark table
(
    ID INT PRIMARY KEY
)

INSERT dbo.qtrack
VALUES (0)

INSERT dbo.testq (filler)
SELECT TOP (30000) 
    REPLICATE(CONVERT(VARCHAR(36), NEWID()), 11) --filler
FROM sys.all_columns a,
     sys.all_columns b;
GO

Here’s the normal style of dequeue:

CREATE OR ALTER PROC dbo.dequeue_regular
AS
BEGIN
    DECLARE @d TABLE
    (
        junk VARCHAR(500)
    );
    WITH CTE
    AS (SELECT TOP (1)
               *
        FROM dbo.testq WITH (ROWLOCK, READPAST)
        ORDER BY ID)
    DELETE CTE
    OUTPUT Deleted.filler
    INTO @d;
END;

And the dequeue using the watermark table:

CREATE OR ALTER PROC dbo.dequeue_watermark
AS
BEGIN

    DECLARE @d TABLE
    (
        ID INT,
        junk VARCHAR(500)
    );

    DECLARE @w INT = (SELECT TOP 1 ID FROM dbo.qtrack)

    WITH CTE
    AS (SELECT TOP (1)
               *
        FROM dbo.testq WITH (ROWLOCK, READPAST)
        WHERE ID >= @w
        ORDER BY ID)
    DELETE CTE
    OUTPUT Deleted.ID,
           Deleted.filler
    INTO @d;

    DECLARE @t INT = (SELECT TOP 1 ID FROM @d)

    IF @t % 100 = 0
    BEGIN
        UPDATE dbo.qtrack
        SET ID = @t - 50;
    END;
END;

There are several things to note with the way I designed this. Updating the watermark in every concurrent dequeue would be a Bad Idea, so I added the %100 such that roughly every hundredth would move the watermark. Also, I didn’t simply adjust the watermark to the processed value, but rather some buffer amount lower. To be super safe, we also set up an every 5min job to rebuild the table, correct the watermark, and log any issues.

And that’s it, a method to protect yourself from ghosts using SQL. It’s been working like a charm for us, and I hope it works for those of you who need it. Oh, and please let me know if you encounter data Bigfoot – I’d like to build a proc for him too.

A Doodled Intro to the Storage Engine

Paul Randal is a SQL Server legend with loads of informative articles. But when I was a baby DBA first reading Inside the Storage Engine, I got a little stuck. It took many passes before, eventually, finally, it clicked. I wish I had a lightweight introduction, so in the practice of paying it forward…

Here’s the starting point: sometimes it’s easier to manage lots of small things (say, the 1s and 0s of data) by grouping them into larger things. It’s the same reason you don’t buy rice by the grain.

For SQL Server, data is logically grouped into 8KB Pages, which are themselves grouped into Extents (an Extent being 8 continuous pages).

Within a file, you need to know where the data is, and that’s why there are mapping pages!

SQL Server uses a combination of maps to describe page usage. The GAM page shows which extents are available (thus it’s 0 for allocated, 1 for open/unused). The SGAM exists because pages in an extent might be used for multiple tables (a mixed extent), or reserved for a single table. The SGAM marks which mixed extents can have a page allocated.

Most mapping pages are data about the space within the file, and are repeated in fixed locations for the area they cover. For example, the GAM page, since it covers about 4GB, is also repeated every 4GB.

Mapping pages for tables, called IAMs, need different treatment however. Think about it – Microsoft doesn’t know ahead of time what tables you’ll have, and can’t have IAMs in fixed locations. So the IAMs contain info about what they cover and where the next IAM is, which allows them to be placed anywhere in the file.

And there’s still more to learn, like those pesky PFS pages! I’m not going to write about them though; Paul already did. So go! Continue your learning journey and profit from your hard-earned knowledge!

Finding a Spinlock Owner in a Dump

Oh. Lovely…

Well, this SQL Server has dumps. At least it stays regular. These happen to all be of the “Non-Yielding Scheduler” variety. Now for the fun task of dump diving.

The Journey Begins

Windbg makes starting relatively easy…load the dump:

Click the blue text:

And wait a long time for windbg to load symbol files. (You don’t actually have to wait if you’re just reading a blog. Kinda nice, huh?) When it finally finishes, behold, the offending stack of a dump:

Now, I may not be an expert, but sqlmin!Spinlock sounds like…a spinlock. This thread has been spinning for over a minute, never returning to a waiting state, because something else is holding the spinlock resource.

Thankfully, helpful friends alerted me to a blog that revealed the value of an acquired spinlock “is the Windows thread ID of the owner.” Meaning I might be able to find the cause.

A Different Approach

I already had a suspect thread – it’s the only other one active when I look at all of SQL Server’s threads.

Actually digging through hundreds of threads is annoying of course, and there’s a slightly faster way: !uniqstack, which removes duplicates. (Thanks to Bob Ward for sharing this trick in one of his presentations.)

If you run that, you’ll see a multitude of stacks with something like this at the top.

Those NtSignalAndWait calls are how SQL Server makes a worker wait. Those workers weren’t doing anything at the time the dump was taken.

However, there’s another thread that isn’t waiting:

I bet this one is holding the spinlock, but I have to prove it.

Obstacles

I need to find the memory address of the owner, and confirm which part of this info is the Windows thread. Let’s start by looking at the spinlocked thread.

Thankfully, the context is already 645, but if we needed to change it, we could use ~645s. Note that 645 is NOT the Windows thread id. Instead, we can look at the part that says Id: 1d60.3914. 1d60 is the process (sqlservr.exe) id, and 3914 is the thread id.

Let’s pull up the registers saved for the thread.

There are a load of them, but we want rip. The rip saves the instruction address this thread was working on at the time, which should be in the spinlock code, and the spinlock code is what we want to figure out where the memory address is.

Cool. Now we can copy that address into the disassembler.

Found it! The critical part of a spinlock has to be an atomic compare and swap – the lock cmpxchg. In this case it’s looking at the memory address in rdi and replacing the value with rsi (unless the lock is already held, which it is). Looking at our registers, rsi is indeed the Windows thread id, 3914. Now we just look at the memory address held in rdi.

d displays a memory address, dq means look at quad words, and L1 means show just one

And there we see the Windows thread id of the spinlock owner, 1280!

We can switch to that thread with ~~[1280]s…yup, it’s the same one I found before, the one that looks like it’s doing Redo operations. Victory!

Of course, when I bothered to actually read the documentation on cmpxchg, I learned that if a different value is in the target address (i.e., the spinlock is held by another thread), that value is placed in the eax register. Meaning I could have bypassed the whole exercise with r eax.

Concussion

Nope, not a typo. The appropriate ending to a windbg post is indeed head trauma. Hope you enjoyed it.