“It’s a very short blocking phase” they said. “Metadata only” they said.
Meanwhile, here’s my online rebuild of a zero row table.
Yes, it’s been running for 30s, blocking the whole time. For extra fun- it’s unkillable.
Findings first: SQL Server stores details about missing index suggestions. The dmv pertinent to screwing up index operations is sys.dm_db_missing_index_group_stats_query
. Should this DMV accumulate a very large number of entries, any index create/rebuild on any table with a missing index hint will take extra time to process those entries. Yes, that’s any table anywhere – doesn’t matter if it’s in a separate database with a single hinted query.
Troubleshooting
The prod version of this issue manifested as 1-3 minutes of SCH-M blocking for index ops, yet only on a single server; I couldn’t replicate it anywhere else. Of course, bad blocking meant nasty timeouts, and that in turn meant I was reluctant to run live diagnostics. Better to just stop index rebuilds. One does not take down production in order to test weird SQL behavior.
An important clue came from a tiny table with a rebuild, and is visible in the whoisactive screenshot above. Why would such a small rebuild need so much CPU? That worker was definitely doing something – the question was what.
My lucky break came with a bad proc that needed emergency index help – in this case the fix was worth the blocking, and the table was non-critical anyways. I readied the index creation script, and prepared an extended event to capture callstacks on waits and spinlock backoffs for my spid.
And…jackpot. The worker was hitting SOSY waits and for 3 minutes every callstack was topped with:
sqllang!CQSIndexStatsMgr::DropIndexStatsWithWildcards sqllang!CQSIndexRecMgr::InvalidateMissingIndexes sqllang!QPEvent::CreateIndex sqllang!CIndexDef::CreateIndex
Missing indexes huh? I started poking around the DMVs, and discovered that sys.dm_db_missing_index_group_stats_query
took several minutes to query – it was even unkillable, just like the index rebuilds.
Repro
I finally had enough clues to cause the issue on purpose. I’m going to show you how to do this at home, because breaking SQL Server is fun…in a test environment.
First, let’s create a table. No rows necessary, just tell SQL Server that there are.
drop table if exists dbo.RebuildTest create table dbo.RebuildTest ( ID int identity primary key, Num int, Filler char(100) ) UPDATE STATISTICS dbo.RebuildTest WITH ROWCOUNT = 1000000, PAGECOUNT = 10000
Now we need to generate missing index hints….lots of them. After some experimenting, I combined two features known to play nicely with caching: temp tables and dynamic SQL. (Dear junior DBA reading this: the prior sentence is sarcasm. Those features are notorious for caching issues.)
SET NOCOUNT ON DECLARE @sql NVARCHAR(4000) DECLARE @stop DATETIME = DATEADD(MINUTE,60,GETDATE()) WHILE GETDATE() < @stop begin SET @sql = 'select * into #t'+CONVERT(CHAR(36),CONVERT(BINARY(16),NEWID()),1)+' from dbo.RebuildTest where Num = (select -1)' EXEC(@sql) end
I ran that from ten sessions for a really long time – I think I accumulated about 15 core hours of spam. Rinse and repeat as needed to get the desired slowdown on your own system.
Consequences of above pollution:
Let’s create an unrelated table, also zero rows.
drop table if exists dbo.OtherTable create table dbo.OtherTable ( ID int identity primary key, Num int, Filler char(100) ) UPDATE STATISTICS dbo.OtherTable WITH ROWCOUNT = 1000000, PAGECOUNT = 10000
If you rebuild now, it will be super fast.
Run a single query that generates a missing index hint however…
select * from dbo.OtherTable where Num = (select 42)
And your rebuild transforms into a lumbering unkillable monstrosity
ALTER INDEX ALL ON dbo.OtherTable REBUILD WITH(ONLINE=ON,MAXDOP=1) --need to tell SQL fake stats again if you want to repeat the tests UPDATE STATISTICS dbo.OtherTable WITH ROWCOUNT = 1000000, PAGECOUNT = 10000
Resolution
Ironically enough, this can be fixed by, ugh, restarting SQL Server, adding the appropriate index, or rebuilding the table behind the missing index spam. Due to the nature of the pollution, none of the SQL handles from the DMV resolved to text, but I was able to work up a way to see which objects were behind the growth.
SELECT TOP 10 mid.database_id, mid.object_id, COUNT(*) c FROM sys.dm_db_missing_index_details mid JOIN sys.dm_db_missing_index_groups mig ON mig.index_handle = mid.index_handle JOIN sys.dm_db_missing_index_group_stats_query migs ON migs.group_handle = mig.index_group_handle GROUP BY mid.database_id, mid.object_id ORDER BY c desc
There’s another detail I haven’t figured out: for my production system, the DMV had 3 million entries and was adding minutes of extra time. For my laptop, 9 million was about 45 seconds. I’m kinda curious where the discepancy comes from, but not enough that I’m willing to spend hours melting my personal machine or digging through disassembly. Besides, disassembly problems belong to Paul White the magical SQL hermit on a remote island who solves weird code issues for booze. and who I secretly aspire to be if I grow up
It’s probably a rare issue to hit as well, needing a server that hasn’t been restarted in a long enough time and unique queries that both generate a hint and don’t get combined with similar SQL. I mean, how ridiculous would it be for people to dynamically fill global temp tables named with guids? Nobody does that, right? Crazytown.
But this is the way of things – do weird stuff in SQL, get weird issues. At least weird issues lead to entertaining blog posts, which I sincerely hope does not incentivize you to keep doing the weird stuff. Especially Service Borker. Seriously, knock that out.
Just had a look at the documentation for sys.dm_db_missing_index_group_stats_query and it says: “The result set for this DMV is limited to 600 rows.”. And then found this: https://github.com/BrentOzarULTD/SQL-Server-First-Responder-Kit/issues/3085#issuecomment-1127563141 – so this might be a bug. Has anyone opened a support case with Microsoft? I would be interesed in the results.