Missing Index Hints and the Abominable Rebuild

“It’s a very short blocking phase” they said. “Metadata only” they said.

Meanwhile, here’s my online rebuild of a zero row table.

What database doing?

Yes, it’s been running for 30s, blocking the whole time. For extra fun- it’s unkillable.

Findings first: SQL Server stores details about missing index suggestions. The dmv pertinent to screwing up index operations is sys.dm_db_missing_index_group_stats_query. Should this DMV accumulate a very large number of entries, any index create/rebuild on any table with a missing index hint will take extra time to process those entries. Yes, that’s any table anywhere – doesn’t matter if it’s in a separate database with a single hinted query.

Troubleshooting

The prod version of this issue manifested as 1-3 minutes of SCH-M blocking for index ops, yet only on a single server; I couldn’t replicate it anywhere else. Of course, bad blocking meant nasty timeouts, and that in turn meant I was reluctant to run live diagnostics. Better to just stop index rebuilds. One does not take down production in order to test weird SQL behavior.

An important clue came from a tiny table with a rebuild, and is visible in the whoisactive screenshot above. Why would such a small rebuild need so much CPU? That worker was definitely doing something – the question was what.

My lucky break came with a bad proc that needed emergency index help – in this case the fix was worth the blocking, and the table was non-critical anyways. I readied the index creation script, and prepared an extended event to capture callstacks on waits and spinlock backoffs for my spid.

And…jackpot. The worker was hitting SOSY waits and for 3 minutes every callstack was topped with:

sqllang!CQSIndexStatsMgr::DropIndexStatsWithWildcards
sqllang!CQSIndexRecMgr::InvalidateMissingIndexes
sqllang!QPEvent::CreateIndex
sqllang!CIndexDef::CreateIndex

Missing indexes huh? I started poking around the DMVs, and discovered that sys.dm_db_missing_index_group_stats_query took several minutes to query – it was even unkillable, just like the index rebuilds.

Repro

I finally had enough clues to cause the issue on purpose. I’m going to show you how to do this at home, because breaking SQL Server is fun…in a test environment.

First, let’s create a table. No rows necessary, just tell SQL Server that there are.

drop table if exists dbo.RebuildTest

create table dbo.RebuildTest (
	ID int identity primary key,
	Num int,
	Filler char(100)
)

UPDATE STATISTICS dbo.RebuildTest WITH ROWCOUNT = 1000000, PAGECOUNT = 10000

Now we need to generate missing index hints….lots of them. After some experimenting, I combined two features known to play nicely with caching: temp tables and dynamic SQL. (Dear junior DBA reading this: the prior sentence is sarcasm. Those features are notorious for caching issues.)

SET NOCOUNT ON

DECLARE @sql NVARCHAR(4000)
DECLARE @stop DATETIME = DATEADD(MINUTE,60,GETDATE())

WHILE GETDATE() < @stop
begin
	
	SET @sql =
	'select *
	into #t'+CONVERT(CHAR(36),CONVERT(BINARY(16),NEWID()),1)+'
	from dbo.RebuildTest
	where Num = (select -1)'

	EXEC(@sql)

end

I ran that from ten sessions for a really long time – I think I accumulated about 15 core hours of spam. Rinse and repeat as needed to get the desired slowdown on your own system.

Consequences of above pollution:

Over nine million!

Let’s create an unrelated table, also zero rows.

drop table if exists dbo.OtherTable

create table dbo.OtherTable (
	ID int identity primary key,
	Num int,
	Filler char(100)
)

UPDATE STATISTICS dbo.OtherTable WITH ROWCOUNT = 1000000, PAGECOUNT = 10000

If you rebuild now, it will be super fast.

Run a single query that generates a missing index hint however…

select *
from dbo.OtherTable
where Num = (select 42)

And your rebuild transforms into a lumbering unkillable monstrosity

ALTER INDEX ALL ON dbo.OtherTable REBUILD WITH(ONLINE=ON,MAXDOP=1)

--need to tell SQL fake stats again if you want to repeat the tests
UPDATE STATISTICS dbo.OtherTable WITH ROWCOUNT = 1000000, PAGECOUNT = 10000

Resolution

Ironically enough, this can be fixed by, ugh, restarting SQL Server, adding the appropriate index, or rebuilding the table behind the missing index spam. Due to the nature of the pollution, none of the SQL handles from the DMV resolved to text, but I was able to work up a way to see which objects were behind the growth.

SELECT TOP 10 mid.database_id, mid.object_id, COUNT(*) c
FROM sys.dm_db_missing_index_details mid
JOIN sys.dm_db_missing_index_groups mig
ON mig.index_handle = mid.index_handle
JOIN sys.dm_db_missing_index_group_stats_query migs
ON migs.group_handle = mig.index_group_handle
GROUP BY mid.database_id,
         mid.object_id
ORDER BY c desc

There’s another detail I haven’t figured out: for my production system, the DMV had 3 million entries and was adding minutes of extra time. For my laptop, 9 million was about 45 seconds. I’m kinda curious where the discepancy comes from, but not enough that I’m willing to spend hours melting my personal machine or digging through disassembly. Besides, disassembly problems belong to Paul White the magical SQL hermit on a remote island who solves weird code issues for booze. and who I secretly aspire to be if I grow up

It’s probably a rare issue to hit as well, needing a server that hasn’t been restarted in a long enough time and unique queries that both generate a hint and don’t get combined with similar SQL. I mean, how ridiculous would it be for people to dynamically fill global temp tables named with guids? Nobody does that, right? Crazytown.

But this is the way of things – do weird stuff in SQL, get weird issues. At least weird issues lead to entertaining blog posts, which I sincerely hope does not incentivize you to keep doing the weird stuff. Especially Service Borker. Seriously, knock that out.