I’ve read it repeatedly, columnstore key lookups are extra slow. The question of course, is why?
In my mental model, it makes sense. A normal key lookup adds about 3 reads.
While a columnstore lookup should add at least a read per column, since each column lives in its own segments.
Logically the next step is to test it. I made a copy of the StackOverflow2013 Posts table and slapped clustered columnstore and a nonclustered index on it.
SELECT TOP (1) *
FROM dbo.PostsCS WITH(FORCESEEK)
WHERE ID > 0
There are 20 columns, so I should expect at least 20 extra reads on top of the inital seek, right?
That’s…a lot more than 20 extra reads. What’s going on?
Segment Lookups, not Row Lookups
First off, each key lookup per column doesn’t just read the page it’s on, but seems to read the entire segment. This can be tested by comparing lob reads to segment size – here are some sample scripts:
SELECT c.name, s.on_disk_size/1024.0/8 AS approx_pages
FROM sys.column_store_segments s
JOIN sys.partitions p
ON p.partition_id = s.partition_id
JOIN sys.columns c
ON c.object_id = p.object_id
AND c.column_id = s.column_id
WHERE p.object_id = OBJECT_ID('dbo.PostsCS')
AND s.segment_id = 11 --I used segment min/max values to confirm this is the correct one
SET STATISTICS IO ON
SELECT TOP (1) LastEditDate --lob logical reads 49, segment size 48
FROM dbo.PostsCS WITH(FORCESEEK)
WHERE ID > 0
So it’s more like this:
Dictionaries
It gets worse.
The following query has 7135 lob reads
SELECT TOP (1) Body
FROM dbo.PostsCS WITH(FORCESEEK)
WHERE ID > 0
Which should make some sense, given that body is nvarchar(max), but the segment is only 43 pages. 43 pages is rather small for lots of nvarchar(max) data isn’t it? That’s because SQL Server is actually storing the data in dictionaries.
So when the key lookup occurs for a column with a dictionary, not only does the whole segment get read, but the dictionary gets read too. And there might be more than one dictionary.
This is a lot of reads.
And a Mystery
Something seems to happen with dictionary-backed columns where there are more reads than necessary, and a lot of read-ahead reads. I haven’t figured out why, and plan to tackle this next.
Recommendations
Columnstore key lookups can have the disastrous performance that people mistakenly attribute to normal key lookups. When you do need to use them, try to 1) pull as few columns as possible, 2) as few rows as possible, and 3) avoid dictionary-backed columns when possible.
I love how there are lots of little things to SQL Server – mostly ignored details that only sometimes matter but make sense when you stare at them. Actually, I’m not sure I love them. I forget about them, and then stub my toe on them. Occasionally though, there’s a quirky combination that leads to bloggable shenanigans.
Let’s start with Detail Number One, which has most definitely tripped me up: queries that are returning rows won’t time out. You might be familiar with connections using a default 30s timeout, but as long the query is returning a row every 29s, the client won’t time it out. You can even test this in SSMS.
WHILE 1=1
BEGIN
SELECT 'tro'+REPLICATE('lo',4000)
WAITFOR DELAY '00:00:29'
END
Now for a puzzle which will lead us to little detail number two: will the below query time out? The scalar function takes just under 3s per row, and there are 15 rows.
SELECT TOP 15 ID, dbo.Take3s(1)
FROM StackOverflow2010.dbo.Users
OPTION(USE HINT('DISABLE_TSQL_SCALAR_UDF_INLINING'))
Here’s the function and plan if you’re curious:
CREATE OR ALTER FUNCTION dbo.Take3s
(@useless BIT)
RETURNS bigint
AS
BEGIN
DECLARE @c bigint
SELECT @c = COUNT(*)
FROM StackOverflow2013.dbo.Comments
OPTION(MAXDOP 1)
RETURN @c
END
And the answer:
It times out, despite being a (mostly) streaming plan. And the reason comes down to Detail #2: rows aren’t delivered to the client one-at-a-time, but in packets.
Even though the rows are produced every 3s, they aren’t delivered every 3s, and the client is left waiting up to the 30s mark where it cancels the query. But wait, we can fix this! All we have to do is fatten up the packet with junk.
SELECT TOP 15 ID, dbo.Take3s(1), REPLICATE('junk',2000)
FROM StackOverflow2010.dbo.Users
OPTION(USE HINT('DISABLE_TSQL_SCALAR_UDF_INLINING'))
And it works!
If you have a slow but streaming query that times out, you can return extra (junk) data in each row to fill up packets sooner and get those packets to the client before the deadline. So yes, you can fix timeouts by making queries worse.
Not that you should do that. This is a lolpost, not permission. I hope you enjoyed it.
Did you know that indexes with included columns have less efficient logging? Let’s update a single column in a single row.
Here’s what the log record for the clustered index looks like:
And here’s the log record for a nonclustered index. Notice a difference?
I’ll present the conclusion now: included columns in an index always get included in the log record, even if they weren’t modified. The same is not true for the clustered index. This can really matter if log generation is important to your environment, e.g. using an AG.
Proof
There are two ways to prove this: by examining log details and by comparing size after changes. We’ll start with the fn_dblog method.
USE TestDB
GO
ALTER DATABASE TestDB SET RECOVERY SIMPLE
GO
CREATE TABLE dbo.AllInclusive
(
ID INT IDENTITY,
Num INT NOT NULL,
JunkA BINARY(4) NOT NULL,
JunkB BINARY(4) NOT NULL,
JunkC BINARY(4) NOT NULL,
JunkD BINARY(4) NOT NULL,
JunkE BINARY(4) NOT NULL,
JunkF BINARY(4) NOT NULL,
INDEX PK_AllInclusive UNIQUE CLUSTERED (ID)
);
CREATE INDEX IX_LotsaJunk ON dbo.AllInclusive (ID)
INCLUDE(Num,JunkA,JunkB,JunkC,JunkD,JunkE,JunkF)
INSERT dbo.AllInclusive (Num,JunkA,JunkB,JunkC,JunkD,JunkE,JunkF)
VALUES (42,0xAAAAAAAA,0xBBBBBBBB,0xCCCCCCCC,0xDDDDDDDD,0xEEEEEEEE,0xFFFFFFFF)
--Tab 1
CHECKPOINT
GO
BEGIN TRAN
UPDATE dbo.AllInclusive
SET Num += 1
WHERE ID = 1
--rollback
--Tab 2
SELECT * --you can see the binary contents in RowLog Contents 0
FROM sys.fn_dblog(NULL,NULL)
And within the messy guts of undocumented output, you can see how uninvolved columns show up for the nonclustered index but not for the clustered.
But maybe divining details from an undocumented function isn’t enough for you. Empiricism, you declare, not haruspicy. Ok then – let’s make a million updates then measure log size, with and without included columns.
Run the setup again, but with larger columns to make the difference clear.
DROP TABLE IF EXISTS dbo.AllInclusive
CREATE TABLE dbo.AllInclusive
(
ID INT IDENTITY,
Num INT NOT NULL,
JunkA BINARY(400) NOT NULL,
JunkB BINARY(400) NOT NULL,
JunkC BINARY(400) NOT NULL,
JunkD BINARY(400) NOT NULL,
JunkE BINARY(400) NOT NULL,
JunkF BINARY(400) NOT NULL,
INDEX PK_AllInclusive UNIQUE CLUSTERED (ID)
);
CREATE INDEX IX_LotsaJunk ON dbo.AllInclusive (ID)
INCLUDE(Num,JunkA,JunkB,JunkC,JunkD,JunkE,JunkF)
INSERT dbo.AllInclusive (Num,JunkA,JunkB,JunkC,JunkD,JunkE,JunkF)
VALUES (42,0xAAAAAAAA,0xBBBBBBBB,0xCCCCCCCC,0xDDDDDDDD,0xEEEEEEEE,0xFFFFFFFF)
CHECKPOINT --step 2
GO
SET NOCOUNT ON
BEGIN TRAN
DECLARE @i INT = 0
WHILE @i < 1000000
BEGIN
UPDATE dbo.AllInclusive
SET Num += 1
WHERE ID = 1
SET @i += 1
END
--COMMIT --step 4, don't forget this
Make sure to check log size before and after, and only commit after measuring.
--step 1 and 3
SELECT *
FROM sys.dm_os_performance_counters
WHERE counter_name = 'Log File(s) Used Size (KB)'
AND instance_name = 'TestDB'
Now change the index to skip the unchanged included columns, and run the test again.
DROP INDEX IX_LotsaJunk ON dbo.AllInclusive
CREATE INDEX IX_NoJunk ON dbo.AllInclusive (ID)
INCLUDE(Num)
I get 7.8GB of logs for the index full of included columns, and 0.75GB of logs for IX_NoJunk. Note that this includes reserved space, so it isn’t a precise measurement of impact, just a proof of difference.
This Matters
Of all the potential bottlenecks for SQL Server, I would say that log generation is easily in the top 5. It matters because logging = disk writes, and it really matters with AGs. Did you know that Azure SQL DB will throttle your log throughput unless you pay (even more) money?
So here you go, a reason to choose key lookups over covering indexes, another reason to say “it depends” when designing a database, and yet another reason to yell at missing index suggestions.
“It’s a very short blocking phase” they said. “Metadata only” they said.
Meanwhile, here’s my online rebuild of a zero row table.
Yes, it’s been running for 30s, blocking the whole time. For extra fun- it’s unkillable.
Findings first: SQL Server stores details about missing index suggestions. The dmv pertinent to screwing up index operations is sys.dm_db_missing_index_group_stats_query. Should this DMV accumulate a very large number of entries, any index create/rebuild on any table with a missing index hint will take extra time to process those entries. Yes, that’s any table anywhere – doesn’t matter if it’s in a separate database with a single hinted query.
Troubleshooting
The prod version of this issue manifested as 1-3 minutes of SCH-M blocking for index ops, yet only on a single server; I couldn’t replicate it anywhere else. Of course, bad blocking meant nasty timeouts, and that in turn meant I was reluctant to run live diagnostics. Better to just stop index rebuilds. One does not take down production in order to test weird SQL behavior.
An important clue came from a tiny table with a rebuild, and is visible in the whoisactive screenshot above. Why would such a small rebuild need so much CPU? That worker was definitely doing something – the question was what.
My lucky break came with a bad proc that needed emergency index help – in this case the fix was worth the blocking, and the table was non-critical anyways. I readied the index creation script, and prepared an extended event to capture callstacks on waits and spinlock backoffs for my spid.
And…jackpot. The worker was hitting SOSY waits and for 3 minutes every callstack was topped with:
Missing indexes huh? I started poking around the DMVs, and discovered that sys.dm_db_missing_index_group_stats_query took several minutes to query – it was even unkillable, just like the index rebuilds.
Repro
I finally had enough clues to cause the issue on purpose. I’m going to show you how to do this at home, because breaking SQL Server is fun…in a test environment.
First, let’s create a table. No rows necessary, just tell SQL Server that there are.
drop table if exists dbo.RebuildTest
create table dbo.RebuildTest (
ID int identity primary key,
Num int,
Filler char(100)
)
UPDATE STATISTICS dbo.RebuildTest WITH ROWCOUNT = 1000000, PAGECOUNT = 10000
Now we need to generate missing index hints….lots of them. After some experimenting, I combined two features known to play nicely with caching: temp tables and dynamic SQL. (Dear junior DBA reading this: the prior sentence is sarcasm. Those features are notorious for caching issues.)
SET NOCOUNT ON
DECLARE @sql NVARCHAR(4000)
DECLARE @stop DATETIME = DATEADD(MINUTE,60,GETDATE())
WHILE GETDATE() < @stop
begin
SET @sql =
'select *
into #t'+CONVERT(CHAR(36),CONVERT(BINARY(16),NEWID()),1)+'
from dbo.RebuildTest
where Num = (select -1)'
EXEC(@sql)
end
I ran that from ten sessions for a really long time – I think I accumulated about 15 core hours of spam. Rinse and repeat as needed to get the desired slowdown on your own system.
Consequences of above pollution:
Let’s create an unrelated table, also zero rows.
drop table if exists dbo.OtherTable
create table dbo.OtherTable (
ID int identity primary key,
Num int,
Filler char(100)
)
UPDATE STATISTICS dbo.OtherTable WITH ROWCOUNT = 1000000, PAGECOUNT = 10000
If you rebuild now, it will be super fast.
Run a single query that generates a missing index hint however…
select *
from dbo.OtherTable
where Num = (select 42)
And your rebuild transforms into a lumbering unkillable monstrosity
ALTER INDEX ALL ON dbo.OtherTable REBUILD WITH(ONLINE=ON,MAXDOP=1)
--need to tell SQL fake stats again if you want to repeat the tests
UPDATE STATISTICS dbo.OtherTable WITH ROWCOUNT = 1000000, PAGECOUNT = 10000
Resolution
Ironically enough, this can be fixed by, ugh, restarting SQL Server, adding the appropriate index, or rebuilding the table behind the missing index spam. Due to the nature of the pollution, none of the SQL handles from the DMV resolved to text, but I was able to work up a way to see which objects were behind the growth.
SELECT TOP 10 mid.database_id, mid.object_id, COUNT(*) c
FROM sys.dm_db_missing_index_details mid
JOIN sys.dm_db_missing_index_groups mig
ON mig.index_handle = mid.index_handle
JOIN sys.dm_db_missing_index_group_stats_query migs
ON migs.group_handle = mig.index_group_handle
GROUP BY mid.database_id,
mid.object_id
ORDER BY c desc
There’s another detail I haven’t figured out: for my production system, the DMV had 3 million entries and was adding minutes of extra time. For my laptop, 9 million was about 45 seconds. I’m kinda curious where the discepancy comes from, but not enough that I’m willing to spend hours melting my personal machine or digging through disassembly. Besides, disassembly problems belong to Paul White the magical SQL hermit on a remote island who solves weird code issues for booze. and who I secretly aspire to be if I grow up
It’s probably a rare issue to hit as well, needing a server that hasn’t been restarted in a long enough time and unique queries that both generate a hint and don’t get combined with similar SQL. I mean, how ridiculous would it be for people to dynamically fill global temp tables named with guids? Nobody does that, right? Crazytown.
But this is the way of things – do weird stuff in SQL, get weird issues. At least weird issues lead to entertaining blog posts, which I sincerely hope does not incentivize you to keep doing the weird stuff. Especially Service Borker. Seriously, knock that out.
The common corporate amoeba floats around, grasping at revenue, customers, and other nutrients with ciliated pseudopods. You, dear gainfully-employed DBA, are an organelle in this subsentient blob. Sometimes your employer-host catches another amoeba, and it becomes your duty to ingest their delicious data.
This was the situation I found myself in earlier this year – our company had absorbed another, and it was time to slurp up their tables. There were a lot of decisions to make and tradeoffs to weigh, and we ended up choosing to trickle-insert their data, but make it invisible to normal use until the moment of cutover.
The way we implemented this was with Row Level Security. Using an appropriate predicate, we could make sure ETL processes only saw migrated data, apps saw unmigrated data, and admins saw everything. To give a spoiler: it worked, but there were issues.
Let’s consider a simplistic RLS filter, with the StackOverflow2010 database for examples.
CREATE FUNCTION dbo.RLSfunctionSimple(@Migrated AS bit)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
SELECT 1 AS Pass
WHERE (@Migrated = 1 AND USER_NAME() IN ('ETLProcess','BubbaTheAdmin'))
OR (@Migrated = 0 AND USER_NAME() NOT IN ('ETLProcess'))
GO
Great! This works. We add an IsMigrated column to the Users table, set the security policy, and bam, we’re done. Yeah no.
ALTER TABLE dbo.Users
ADD IsMigrated BIT NOT NULL DEFAULT(0)
CREATE SECURITY POLICY RLS_Users
ADD FILTER PREDICATE dbo.RLSfunctionSimple(IsMigrated)
ON dbo.Users
WITH (STATE = ON);
More people than just Bubba want to see migrated data. A migration is a big deal, and everyone, everyone, is going to ask to see migrated data for reasons that are honestly quite legitimate. You will constantly be modifying who gets access.
There’s another issue: adding RLS to a table takes a SCH-M lock on that table. Have you experienced the joys of frequent SCH-M locks on critical tables in a highly active database?
So out of a desire to easily make access changes and avoid SCH-M locks (and because you can’t alter the function while it’s used by the security policy), you decide to make the RLS function reference tables. Let’s add one table for the ETL processes that should only see migrated rows, and one table for the folks who want to read everything. Everyone else by default only views unmigrated data.
CREATE FUNCTION dbo.RLSfunctionComplex(@Migrated AS bit)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
SELECT 1 AS Pass
WHERE (@Migrated = 1
AND EXISTS (SELECT 1 FROM dbo.SeeMigrated WHERE UserName = USER_NAME())
OR EXISTS (SELECT 1 FROM dbo.SeeEverything WHERE UserName = USER_NAME())
)
OR (@Migrated = 0
AND NOT EXISTS (SELECT 1 FROM dbo.SeeMigrated WHERE UserName = USER_NAME())
)
And now this
SELECT DisplayName
FROM dbo.Users
WHERE ID = 12345
Becomes this
RLS that references tables will break queries. Sometimes, the added complexity just made the optimizer fall over. Other times useful query transformations broke.
Note that the above query plan has nine extra nodes. Imagine a relatively complex query, where most of the involved tables have RLS. Let me tell you, things break. You might say, oh don’t worry, we’ll only add to core tables. No, that’s not how it will work. Bad data is expensive. The business, at least during a migration, cares about data quality. If you have RLS functioning fine on N tables, there will be pressure to make it N+1, just to be safe.
There’s a specific and common transformation that breaks when RLS reads other tables, the TOP-MAX transformation.
Here’s the normal version, at 0ms.
SELECT MAX(ID)
FROM dbo.Users
And here it is after RLS, scanning your tables and taking 116ms
Any place you rely on the TOP-MAX transformation, you will have to rewrite as an explicit TOP 1.
Recommendations
Try as hard as you reasonably can to avoid external references in your RLS function. We still ended up with one – we landed on a single look-up-everything table, and it didn’t break as much as other multi-table functions did.
Expect your coverage to grow beyond initial estimates, whether users who need access or tables that need protection.
You can hot-swap your RLS function with ALTER SECURITY POLICY. It still takes a SCH-M lock on the affected table, but it’s a lot better than dropping your security policy in order to alter the function.
All-in-all, I’m mostly ok with our decision. It meant a lot of work ahead of time (including adding the migration column to tables – don’t forget to include in the indexes as well). But when the migration day came, we turned RLS off, and everything just worked. Given typical cutover craziness, that’s not a bad tradeoff to make.
A #sqlhelp member runs through the Slack channel, “Help! Help! My insert has been killed!” Bystanders gather around the still-warm body to stare at the soulless consolas note fixed to its chest.
Rough way to go, that.
Clamor erupts immediately. Accusations fly. Anyone remotely resembling a butler starts edging away. The professionals wait for evidence, but what they see next raises a line of eyebrows.
INSERT [Target] (Id)
SELECT DISTINCT Id
FROM [Source] s
WHERE SomeCondition = 1
AND NOT EXISTS (
SELECT 1
FROM [Target] t
WHERE t.Id = s.Id
)
“A duplicate despite a DISTINCT bodyguard…” one mutters. “This isn’t the normal level of foul play.”
Mods arrive to cordon off the scene. Twelve different butlers have already been accused, but each has a solid alibi, even MAXDOP.
Generic Protagonist paces back and forth. They’re waiting for the forensics report. Finally it arrives.
“Intriguing.” Protagonist mutters under their breath. Hapless Policeman #5 is forced to listen to an analysis. “There are two possible culprits here: a concurrent session inserting the duplicate value after this one has performed the NOT EXISTS check, or the query is pulling duplicate values somehow from the Source table.”
Generic leans closer and puts on their glasses. “And the DISTINCT bodyguard is missing! No operator here performs that duty, even though the query requested a protection detail.” A quick check with the witness confirms that the column has a unique constraint on the Source table. “Looks like Concurrent Inserts just became our number one suspect.”
A lone, sputtering lightbulb hangs over a table. “Did you do it?” Generic Protagonist asks from opposite Concurrent Insert Theory. “Did you insert a row into the target table while another query was trying to insert the same row?”
Being a barely-anthropomorphized query pattern, it says nothing. Protagonist scoffs, “I knew you’d say that. That’s what they all say. But what about…EVIDENCE!”
Generic Protagonist is unable to contain themselves and starts pacing the small room. “We both know that Serializable Isolation Level is required to prevent new rows entering the queried range before a transaction is finished. The range locks on the target table will last until the query is finished, preventing problematic inserts and stopping your duplicate PK crime spree.”
Protagonist waves a sheet of paper, “And guess what I just had the lab check? That’s right, Serializable stops the duplicate PK error.” Concurrent Insert Theory says nothing. Generic leans back smugly, “I thought so. Guards! Take it away.”
Generic is about to file away the notes when a witness walks in. “More evidence for you,” he says, “Repeatable Read also prevents the duplicate insert.” Protagonist freezes, processing. “But Repeatable Read only holds locks on existing rows until the end of a transaction – it doesn’t prevent new inserts…” A pause follows. “I may have made a terrible mistake.”
It takes a late night call and some favors, but a lab tech is found who can run another test. What about a test version of the query on a new target that is guaranteed to run by itself? It fails. It fails! They thank the lab tech, who responds, “Of course! Anything when a barely-anthropomorphized query pattern is at stake!” It is not Concurrent Inserts. It must be Duplicates from Source. But how?
Generic Protagonist turns to the witness. What isolation level was the query running under? “Read Committed” is the answer. “Ah,” Generic replies, “are you aware of the criminological paper, Read Committed is a Garbage Isolation Level? I think I know how the murder happened.”
Various suspects, witnesses, and even a butler or two are gathered into a room. “I’m sure you’re wondering why I called you here.” Generic proclaims. “Actually no,” says one of the butlers, “every generic detective has a scene where they reveal the culprit to a room of people. Just get to the point.” Protagonist sighs, and gestures to two attendees. “Fine. They did it. It was Duplicates from Source along with its accomplice, Migrating Data.”
They continue, “Read Committed is what allowed them to enact the crime. You see, there were concurrent updates in the Source table that would move rows around, causing some to be read twice. And no operator enforced distinct results from the query, because there already was a constraint and the engine didn’t think it was necessary.”
Duplicates from Source stares blankly as officers take it away, destined to spend the rest of its days under Repeatable Read.
Satisfied with a case well-solved, Generic Protagonist adds a final entry to the case file, titled “Evidence: Repro”
————————————————————-
Create a copy of the Users table from StackOverflow2013 in your test database, a target table, and an index.
USE TestDB
CREATE TABLE dbo.UsersCopy
(
[Id] [int] NOT NULL PRIMARY KEY CLUSTERED,
[AboutMe] [nvarchar] (max) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Age] [int] NULL,
[CreationDate] [datetime] NOT NULL,
[DisplayName] [nvarchar] (40) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[DownVotes] [int] NOT NULL,
[EmailHash] [nvarchar] (40) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[LastAccessDate] [datetime] NOT NULL,
[Location] [nvarchar] (100) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Reputation] [int] NOT NULL,
[UpVotes] [int] NOT NULL,
[Views] [int] NOT NULL,
[WebsiteUrl] [nvarchar] (200) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[AccountId] [int] NULL
)
GO
CREATE INDEX IX_UsersCopy_SomeColumns
ON dbo.UsersCopy (Age, DownVotes)
INCLUDE (CreationDate)
GO
CREATE TABLE dbo.UsersTarget (
[Id] [int] NOT NULL PRIMARY KEY CLUSTERED
)
GO
INSERT dbo.UsersCopy
SELECT *
FROM StackOverflow2013.dbo.Users
GO
--is rcsi off?
USE [master]
GO
ALTER DATABASE [TestDB] SET READ_COMMITTED_SNAPSHOT OFF WITH NO_WAIT
GO
Scanning the index takes ~80ms. Let’s use WAITFOR TIME statements to run an update 10ms after the insert started reading from the source table.
WAITFOR TIME '06:16:25.462'
INSERT dbo.UsersTarget
SELECT DISTINCT ID
FROM dbo.UsersCopy u
WHERE CreationDate = '2008-07-31 21:57:06.240'
AND NOT EXISTS (
SELECT 1
FROM dbo.UsersTarget ut
WHERE ut.Id = u.Id
)
WAITFOR TIME '06:16:25.472'
UPDATE dbo.UsersCopy
SET DownVotes = 1000000
WHERE ID = 10
And a helper script for the WAITFORs
DECLARE @dt TIME(3) = DATEADD(SECOND,40,SYSDATETIME())
PRINT 'WAITFOR TIME '''+CONVERT(VARCHAR(40),@dt)+''''
PRINT 'WAITFOR TIME '''+CONVERT(VARCHAR(40),DATEADD(MILLISECOND,10,@dt))+''''
Cleanup/reset in case you missed something or want to test multiple times.
UPDATE dbo.UsersCopy
SET DownVotes = 0
WHERE ID = 10
truncate table dbo.UsersTarget
There’s a lot already written about left versus right deep hash joins. Unfortunately for us SQL Server nerds, “left” and “right” don’t make as much sense in SSMS query plans – those have a different orientation than the trees of database theory.
But if you rotate plans, you can see left and right that make sense (even if they still don’t match canonical shapes). Just follow the join operators.
The great thing about left deep plans is that only a couple of hash tables need to exist at a time, saving memory.
Whereas a right-deep hash join plan needs to keep all the hash tables around until it’s done.
So why would SQL Server ever choose a right-deep plan? There are two key ideas to this. First, each shape has just one privileged position, a single table that is only part of probes, never needing to fill up a hash table, never adding to the memory overhead. Review the two plans above – can you spot which in each never contributes to a build?
Second, SQL Server has to pick just two tables to start with (more than two is an n-ary join, which may theoretically exist). Certain join patterns let you choose any table to start with, while some severely limit choices. Star joins, a common data warehouse pattern, require the central table to be part of the initial join.
I had a vague intuition of this part, but drawing it out really helped me – here are three basic patterns I tend to see:
First, an N-ary join, where all tables join on an equivalent value.
SELECT *
FROM Users A
JOIN BannedUsers B
ON A.UserID = B.UserID
JOIN Donors C
ON A.UserID = C.UserID
JOIN Achievements D
ON C.UserID = D.UserID
Then there are what I call chained joins (I assume there’s an official name, but I don’t know it).
SELECT *
FROM Users A
JOIN Vacations B
ON A.UserID = B.UserID
JOIN Itineraries C
ON B.VacationID = C.VacationID
JOIN Destinations D
ON C.DestinationID = D.DestinationID
Finally, and most importantly for this topic, there are star joins, where everything relates back to a central table.
SELECT *
FROM UserFacts A
JOIN NameDim B
ON A.NameID = B.NameID
JOIN AddressDim C
ON A.AddressID = c.AddressID
JOIN DemographicsDim D
ON A.DemographicsID = D.DemographicsID
In a star join, that central table is usually massive. Billions of rows into a hash table is typically a Bad Idea. And since all joins involve that table, SQL Server can’t delay it until the privileged position of a left deep plan. Instead, there will be a right deep hash join plan, and the massive table will be placed in the privileged position that never fills up a hash table.
So right-deep hash join plans exist because sticking gajillions of rows into a hash build tends to go poorly, and the optimizer picks a right deep plan to avoid blowing out your tempdb. Hurray for the optimizer!…this time.
But wait, where did that Top operator come from? Oh, SQL Server, you clever rascal…
Simple and extremely effective – a solid query transformation. But can you handle more complex versions, SQL Server?
Let’s try the bane of ordered scans: partitioned tables.
Please note that all repro scripts are shared at the bottom
Partitioning
The transformation works if the column is both the leading index key and the partitioning key.
But fails if the leading key is not the partitioning key.
I guess that makes sense to me, because it would have to grab the top value from each partition then sort those, a much more complicated transformation.
Second Column
Can SQL Server use the Top/Max transformation when looking at the second column of an index? In theory it should work.
Time for our favorite DBA sing-along song, It Depends.
Here’s the table, clustered on c1, c2, loaded with some filler data as seen in the repro script at the end.
CREATE TABLE dbo.A(
c1 INT NOT NULL,
c2 INT NOT NULL,
INDEX CX_somecreativename CLUSTERED (c1,c2)
)
The simplest form works
SELECT MAX(c2)
FROM dbo.A
WHERE c1 = 3
What if we make it more complicated? Let’s add a basic temp table.
CREATE TABLE #3 (three INT NOT NULL)
INSERT #3 VALUES (3)
This works…
SELECT MAX(c2)
FROM A
WHERE c1 = (SELECT TOP(1) 3 FROM #3)
But this doesn’t
SELECT MAX(c2)
FROM A
WHERE c1 = (SELECT TOP(1) three FROM #3)
Spot the difference? Pulling a column value from the second table breaks the transformation, but pulling a constant keeps it. Weird. Anyways, let’s try a rewrite (though inexact) of the query.
SELECT the_max
FROM (SELECT TOP(1) three AS c3 FROM #3) s
CROSS APPLY (
SELECT MAX(c2) AS the_max
FROM dbo.A
WHERE A.c1 = s.c3
) x
Victory again! Wait, what’s that? A change to the table definition making the second column nullable?
ALTER TABLE dbo.A
ALTER COLUMN c2 INT NULL
Ok fine, I’m sure the last query won’t…oh.
So the TOP/MAX transformation is finicky on a second column, depending on nullability [is that a word?] and how the query is written.
Going Backwards
There’s even a situation where SQL Server should (should!) be able to do a reverse transformation, a TOP into a MAX. Columnstore indexes store maximum values per segment, so a TOP(1) query could simply grab that!
Conclusion
I can’t tell you how tempted I was to call this MAXTOP. Thinking about it still causes stray giggles to bubble up, as I gleefully envision sharing this info with eager young DBAs at a SQL Saturday, then ushering them off to a session on parallelism. Thankfully I’m not that evil.
The real conclusion is that SQL Server is programmed to be very clever. Despite the cleverness though, details matter, and the cleverness often falls short. What’s that warm glow I feel inside? Ah yes, job security.
Repro scripts
USE TestDB
GO
CREATE PARTITION FUNCTION pf_hundreds(int)
AS RANGE RIGHT FOR VALUES (100,200,300,400,500,600,700,800,900);
CREATE PARTITION SCHEME ps_hundreds
AS PARTITION pf_hundreds
ALL TO ( [PRIMARY] );
GO
DROP TABLE IF EXISTS PartitionedNumbers
GO
CREATE TABLE PartitionedNumbers (
Number1 INT NOT null,
Number2 INT NOT NULL,
CONSTRAINT PK_numnum PRIMARY KEY CLUSTERED(Number1,Number2))
ON ps_hundreds(Number1) --drop/create table partitioned on Number2 for second query
INSERT PartitionedNumbers
SELECT TOP(999) ROW_NUMBER() OVER(ORDER BY 1/0), ROW_NUMBER() OVER(ORDER BY 1/0)
FROM master.dbo.spt_values
SELECT MAX(Number1)
FROM PartitionedNumbers
USE TestDB
GO
DROP TABLE IF EXISTS dbo.A
DROP TABLE IF EXISTS #3
go
CREATE TABLE dbo.A(
c1 INT NOT NULL,
c2 INT NOT NULL,
INDEX CX_somecreativename CLUSTERED (c1,c2)
)
INSERT dbo.A
SELECT TOP(100000)
(ROW_NUMBER() OVER(ORDER BY 1/0))%100,
(ROW_NUMBER() OVER(ORDER BY 1/0))%500 + 250
FROM sys.columns a, sys.columns b, sys.columns c
CREATE TABLE #3 (three INT NOT NULL)
INSERT #3 VALUES (3)
--works
SELECT MAX(c2)
FROM dbo.A
WHERE c1 = 3
--works
SELECT MAX(c2)
FROM dbo.A
WHERE c1 = (SELECT 3)
--works
SELECT MAX(c2)
FROM A
WHERE c1 = (SELECT TOP(1) 3 FROM #3)
--does not work
SELECT MAX(c2)
FROM A
WHERE c1 = (SELECT TOP(1) three FROM #3)
--works
SELECT the_max
FROM (SELECT TOP(1) three AS c3 FROM #3) s
CROSS APPLY (
SELECT MAX(c2) AS the_max
FROM dbo.A
WHERE A.c1 = s.c3
) x
--change things around
ALTER TABLE dbo.A
ALTER COLUMN c2 INT NULL
--does not work
SELECT the_max
FROM (SELECT TOP(1) three AS c3 FROM #3) s
CROSS APPLY (
SELECT MAX(c2) AS the_max
FROM dbo.A
WHERE A.c1 = s.c3
) x
Ok, I get it, scheduling queries can be complicated. See this and this and maybe this and this too but only if you have insomnia. I still thought I kinda understood it. Then I started seeing hundreds of query timeouts on a quiet server, where the queries were waiting on…what? CPU? The server’s at 20%, why would a query time out after a minute, while only getting 5s of CPU, and not being blocked?
It was Resource Governor messing up one of my servers recently in a way I never expected. Apparently RG will try to balance *total* CPU usage for pools, to the point that a single query sharing a scheduler will be completely starved.
Uh, this might be easier to explain with a picture:
And a demo of course:
Let’s set up some Resource Pools with no constraints. (Thanks to this blog for making some easy-to-copy scripts). Again, these pools don’t limit anything – the only thing they do is exist. Behold, pools Apple and Banana:
USE master
GO
CREATE RESOURCE POOL [Apples] WITH(
min_cpu_percent=0,
max_cpu_percent=100,
min_memory_percent=0,
max_memory_percent=100,
AFFINITY SCHEDULER = AUTO
)
GO
CREATE RESOURCE POOL [Bananas] WITH(
min_cpu_percent=0,
max_cpu_percent=100,
min_memory_percent=0,
max_memory_percent=100,
AFFINITY SCHEDULER = AUTO
)
GO
CREATE WORKLOAD GROUP [Apple]
USING [Apples]
GO
CREATE WORKLOAD GROUP [Banana]
USING [Bananas]
GO
CREATE FUNCTION dbo.Fruit_Inspector()
RETURNS SYSNAME WITH SCHEMABINDING
AS
BEGIN
DECLARE @workload_group sysname;
IF (program_name() LIKE 'A%')
SET @workload_group = 'Apple';
IF (program_name() LIKE 'B%')
SET @workload_group = 'Banana';
RETURN @workload_group;
END;
GO
ALTER RESOURCE GOVERNOR WITH (CLASSIFIER_FUNCTION = dbo.Fruit_Inspector);
GO
ALTER RESOURCE GOVERNOR RECONFIGURE
GO
Now for a little bit of SSMS trickery: you can use connection properties to set an application name of your choice, assigning your connection to a resource pool.
Check which scheduler you’re on (and your session id, workload group, and app name) with this query:
SELECT er.scheduler_id, wg.name AS wg_name, er.session_id, es.program_name
FROM sys.dm_exec_requests er
JOIN sys.resource_governor_workload_groups wg
ON wg.group_id = er.group_id
JOIN sys.dm_exec_sessions es
ON er.session_id = es.session_id
WHERE er.session_id = @@SPID
Get a session ready for Resource Pool Apple – it doesn’t matter what scheduler it’s on. Place an infinite-loop CPU-sucking query in it, ready to go. I don’t start it until later.
DECLARE @trash bigint
WHILE 1=1
BEGIN
SELECT @trash = COUNT(*) FROM dbo.Votes
end
Now get sessions for Pools Apple and Banana that happen to land on an identical scheduler (I did this by opening a bunch of Apple sessions, then reconnecting the Banana session until I got a match).
Start the other infinite-loop query, then run a high-CPU query on each of colliding sessions – here’s what I used:
SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a
OPTION(MAXDOP 1,USE HINT('QUERY_OPTIMIZER_COMPATIBILITY_LEVEL_140'))
Remember we have two queries running from Pool Apple, and one from Pool Banana, where the Banana query is on the same scheduler as an Apple query. Here’s the Banana runtime – 15s of pure CPU:
But here’s the Apple runtime:
Apple took 30s because it wasn’t allowed *any* CPU time while Banana was running. Because SQLOS was trying to balance the resource pools with the infinite query running, the Apple query was waiting on SOS_SCHEDULER_YIELD for the whole Banana query, which I do think is absolutely bananas.
You’ll even see super-long scheduler yield waits with this pattern.
Yes, seriously, waiting 5 seconds to get time on the CPU.
Please note this doesn’t even need to be a heavy query from Pool A:
In summary, have the doodle again:
In other summary, Resource Governor is scary, I blame it for hundreds of avoidable timeouts, and I hope MS improves their documentation.
P.S. If you followed my demo, that infinite loop is still running – might wanna kill it. Even worse, RG is still around, which you can fix with the below:
A failover that needs five minutes for the secondary database to come online is a problem, especially if it generates stack dumps. Thankfully, Microsoft has Top Men, and with their help digging into source code, we uncovered…well, something neither of us had ever seen before. I’m naming them Zombie Transactions, because then I can pretend my job is more glamorous than clicking at a screen all day.
Why the Long Failover?
Every time there’s a failover, SQL Server must run the recovery process. The standard Analysis, Redo, and Undo occur, and even sometimes an extra step. In our case, it was Redo taking a ridiculously long time. According to XE, SQL Server was searching through the log for 25,000 active transactions every recovery (and incidentally producing stack dumps because this took so long).
Except these transactions weren’t really live, they were dead, shambling about our system, eating our performance. They didn’t come from normal activity, they didn’t show up in dmvs, and they couldn’t be killed with a failover.
I eventually found where they were lurking by using fn_dblog – they were in our checkpoints. Every checkpoint needs to include active transactions – it’s part of the design that allows a checkpoint to occur without freezing your whole database. And for some reason, our database was including this horde of undead transactions in each checkpoint.
Unresolved Non-DTC Cross Database…What?
This is where Microsoft support was helpful (you know, Top support, not the level 1s you have to put up with for the first month of an issue). The checkpoint entries were identified as “unresolved non-DTC cross-database transactions.” This awkward tongue-twister is the other reason I’m calling them zombies.
In other words, these are leftovers from transactions spanning multiple databases with an AG (for example, a trigger that logs changes to a separate non-AG database). If the transaction doesn’t complete correctly, it can leave an unresolved zombie behind. A minimal repro succeeded at generating a couple with a mid-transaction failover, though we never figured out how exactly 25k came about.
Zombie Hunting
I got tired of checking manually, so I wrote a script to search for zombies. It uses fn_dblog, which is undocumented and comes with caveats. I recommend running on a secondary.
USE [MyDB]
GO
CHECKPOINT
go
DECLARE @base bigint
declare @desc table (descr varchar(500))
select top(1) @base = CONVERT(bigint,CONVERT(varbinary(6),'0x'+REPLACE([Transaction ID],':',''),1))
from sys.fn_dblog(null,null)
where [Transaction ID] <> '0000:00000000'
insert @desc
select Description
from sys.fn_dblog(null,null)
where Operation = 'LOP_XACT_CKPT'
select COUNT(case when xacts.xval < @base then 1 end) as zombies, COUNT(*) as total
from (
select CONVERT(bigint,CONVERT(varbinary(6),'0x'+x.value,1)) xval
from (
select REPLACE(REPLACE(descr,'XdesIDs: ',''),':','') xdes
from @desc
) l
cross apply STRING_SPLIT(l.xdes,',') x
) xacts
where xval <> 0
How to Fix [Everything with Computers]
There was some back and forth with Microsoft on how to remove these – I didn’t want to edit memory, and restarting the server didn’t help. However, I quickly discovered that setting the database offline then online fixed it. That’s right, turning it off then on again was a solution.
After some more experimenting, we figured out that setting the database to Read-Only then back would fix it, and was faster than offline/online. A teammate also discovered that reseeding is necessary, since the zombies would reappear if you fail over. He built an awesome script for doing this, and I’m sure he would be willing to help for entirely reasonable rates™. Our failover time went from five minutes to ten seconds after we cleared the zombies.
Summary
You might have zombie transactions if you’ve been running an AG without the DTC setting. If you do, and there are a lot, they can slow down failovers. To kill them off, you’ll have to set the database to read-only mode, or bring offline-online. Yes, this requires downtime.
Please let me know if you find them in your system – seriously, let me know, and let Microsoft know. That way there’s a chance of us getting a way to fix without needing downtime.