November 2018 – Forrest Shares Stuff

Who knew cascading deletes could be so horribly entertaining (emphasis on horrible)? The problems I wrote about traced back to cardinality, but I wasn’t ready to write about it. So, uh, I might have spent most of a perfectly good weekend reading Dmitry and Microsoft patents.

The scan plan was coming from a really bad cardinality assumption: given a Parent ID sufficiently outside the stats histogram, *every* Child table row would join to it. But why was I getting such a ridiculous estimate?

Time for an undocumented trace flag (gosh, I love these)! Dmitry’s post introduced me to trace flag 2363, which outputs cardinality estimation info for the new estimator. I tossed in a QUERYTRACEON, and got the following.

Coooooooool! The optimizer considers stats on Parent.ID to be a badly-formed histogram, whatever that is. Moreover, it just gives up, moves to CSelCalcGuessComparisonJoin, and gives selectivity of 1 (aka everything). Nice guess.

What happens when the ID is within range?

Some huge amount of testing later, I’m able to determine that the bad bailout only occurs when 1) Parent stats are “badly-formed” and 2) the ID is out of range. Stats updated with a FULLSCAN aren’t badly-formed, which seems to be why the problem disappears in that circumstance.

Also interesting is what happens in 2014 compatibility with QO Fixes off.

It ejects, but uses CSelCalcSimpleJoin, with a much more reasonable guess.

I kind of have my answer at this point, which that the reason for crappy cardinality is buried deep within the proprietary bowels of SQL Server’s estimator design. Luckily there was a holiday sale on unsatisfying answers, and I got a new question for free.

What the heck is a “badly-formed histogram?”

Spelunking with windbg left me in the dark, so I resorted to empirical testing with stats donating. The best I can tell, a badly-formed histogram only occurs for a column that is unique and has decimal-valued range_rows. In turn, decimal-valued range_rows only seem to occur for large-enough tables with sampled statistics. How’s that for a summary of weeks of investigation?

To be honest, I’m not really happy with the state of my understanding here. My answers are more anecdote than explanation, only looking at a lower layer of turtle. Maybe if I booze up a Microsoft employee enough I can get them to share info outside of NDA…

Anyways, I’m off to read more patents.

I had an interesting conversation with a developer the other day – it went something like this:

Developer: “Why does it take three minutes to delete a single row from this table?”

Me: “Hold on while I look at the plan. Oh, it’s because SQL Server is checking the 460 foreign keys that reference that table, 447 of which don’t have a supporting index.”

I know the pains of foreign keys without indexes. Oh, do I know. So when I saw a cascading single-row delete plan with a scan on the child table, I thought, “Aha! I know how to fix this, an index!”

But the index was already there, a perfect one for supporting that foreign key with the abominable ON DELETE CASCADE . Something was obviously wrong, and I immediately suspected our recent switch to 2016 compatibility level – meaning a new cardinality estimator.

Sure enough, in 2012 mode the query optimizer produced a seek.

I was able to come up with a repro of the issue (which is what the above plans are from).

--set up two tables, one with a foreign key with cascading delete
DROP TABLE IF EXISTS dbo.Child
DROP TABLE IF EXISTS dbo.Parent
GO

CREATE TABLE dbo.Parent (
ID INT IDENTITY PRIMARY KEY CLUSTERED,
junk CHAR(1)
)

INSERT dbo.Parent
SELECT TOP 1000000 'a'
FROM master..spt_values a
CROSS JOIN master..spt_values b

CREATE TABLE dbo.Child (
ID INT IDENTITY PRIMARY KEY CLUSTERED,
morejunk CHAR(5),
ParentID INT
)

ALTER TABLE dbo.Child
ADD CONSTRAINT [💩] FOREIGN KEY (ParentID)
REFERENCES dbo.Parent (ID) 
ON DELETE CASCADE
ON UPDATE CASCADE

CREATE INDEX IX_agggghhhhhh ON dbo.Child(ParentID)

INSERT dbo.Child
SELECT TOP 1100000 'bah!', 42 --repro also works with random numbers for FK column 
FROM master..spt_values a
CROSS JOIN master..spt_values b
GO


-----------------------------------------------------------------
--Just need estimated plan here. Does it scan on Child?
DELETE dbo.Parent
WHERE ID = -1  --any value here too low or too high will cause a scan, often including values that do exist in Parent!

The short of it is, with any value too far outside of the histogram of the parent table, the new cardinality estimator assumes that EVERY SINGLE ROW of the child table will get deleted.

This can be seen even when a seek is forced through a USE PLAN hint.

Interestingly enough, running the new cardinality estimator in 2014 compatibility level with Query Optimizer Fixes off will also produce a seek plan. I found a few other ways to get the good plan, and stuck them in a bonus script.

DELETE dbo.Parent
WHERE ID = -1
OPTION(
USE HINT('FORCE_LEGACY_CARDINALITY_ESTIMATION')
)

DELETE dbo.Parent
WHERE ID = -1
OPTION(
USE HINT('ASSUME_JOIN_PREDICATE_DEPENDS_ON_FILTERS')
)
--legacy CE with base containment (TF2301) still seeks though


--Repro is dependent on statistics
SELECT * FROM sys.dm_db_stats_properties_internal(OBJECT_ID('dbo.Parent'),1)
SELECT * FROM sys.dm_db_stats_histogram(OBJECT_ID('dbo.Parent'),1)
SELECT * FROM sys.dm_db_stats_histogram(OBJECT_ID('dbo.Child'),2)

UPDATE STATISTICS dbo.Parent WITH SAMPLE 5 PERCENT  --allows problem
UPDATE STATISTICS dbo.Parent WITH FULLSCAN  --changes histogram and prevents problem
UPDATE STATISTICS dbo.Parent WITH SAMPLE 99 PERCENT  --still allows problem

There’s something I don’t understand happening with the cardinality estimator here (as is usual). It seems when the value being deleted from the Parent table is within its histogram, the optimizer uses stats from the Child. Otherwise, it assumes the entire Child table matches. Ouch.

Remember how my demo only has two tables? The real scenario I encountered was a cascading delete hitting forty tables, and, you guessed it, scanning each one. Block party!

Forcing plans with the help of Query Store only goes so far, especially when it’s playing whack-a-mole with ORM-generated SQL. The best solution for now might be to retreat to 2012 compatibility and complain on the internet. Bah!

Month: November 2018

Cascading Deletes – Cardinality

Cascading Delete Troubles