Cascading Deletes – Cardinality

Who knew cascading deletes could be so horribly entertaining (emphasis on horrible)? The problems I wrote about traced back to cardinality, but I wasn’t ready to write about it. So, uh, I might have spent most of a perfectly good weekend reading Dmitry and Microsoft patents.

The scan plan was coming from a really bad cardinality assumption: given a Parent ID sufficiently outside the stats histogram, *every* Child table row would join to it. But why was I getting such a ridiculous estimate?

Time for an undocumented trace flag (gosh, I love these)! Dmitry’s post introduced me to trace flag 2363, which outputs cardinality estimation info for the new estimator. I tossed in a QUERYTRACEON, and got the following.

Coooooooool! The optimizer considers stats on Parent.ID to be a badly-formed histogram, whatever that is. Moreover, it just gives up, moves to CSelCalcGuessComparisonJoin, and gives selectivity of 1 (aka everything). Nice guess.

What happens when the ID is within range?

Some huge amount of testing later, I’m able to determine that the bad bailout only occurs when 1) Parent stats are “badly-formed” and 2) the ID is out of range. Stats updated with a FULLSCAN aren’t badly-formed, which seems to be why the problem disappears in that circumstance.

Also interesting is what happens in 2014 compatibility with QO Fixes off.

It ejects, but uses CSelCalcSimpleJoin, with a much more reasonable guess.

I kind of have my answer at this point, which that the reason for crappy cardinality is buried deep within the proprietary bowels of SQL Server’s estimator design. Luckily there was a holiday sale on unsatisfying answers, and I got a new question for free.

What the heck is a “badly-formed histogram?”

Spelunking with windbg left me in the dark, so I resorted to empirical testing with stats donating. The best I can tell, a badly-formed histogram only occurs for a column that is unique and has decimal-valued range_rows. In turn, decimal-valued range_rows only seem to occur for large-enough tables with sampled statistics. How’s that for a summary of weeks of investigation?

To be honest, I’m not really happy with the state of my understanding here. My answers are more anecdote than explanation, only looking at a lower layer of turtle. Maybe if I booze up a Microsoft employee enough I can get them to share info outside of NDA…

Anyways, I’m off to read more patents.

Cascading Delete Troubles

I had an interesting conversation with a developer the other day – it went something like this:

Developer: “Why does it take three minutes to delete a single row from this table?”

Me: “Hold on while I look at the plan. Oh, it’s because SQL Server is checking the 460 foreign keys that reference that table, 447 of which don’t have a supporting index.”

I know the pains of foreign keys without indexes. Oh, do I know. So when I saw a cascading single-row delete plan with a scan on the child table, I thought, “Aha! I know how to fix this, an index!”

But the index was already there, a perfect one for supporting that foreign key with the abominable ON DELETE CASCADE . Something was obviously wrong, and I immediately suspected our recent switch to 2016 compatibility level – meaning a new cardinality estimator.

Sure enough, in 2012 mode the query optimizer produced a seek.

I was able to come up with a repro of the issue (which is what the above plans are from).

The short of it is, with any value too far outside of the histogram of the parent table, the new cardinality estimator assumes that EVERY SINGLE ROW of the child table will get deleted.

This can be seen even when a seek is forced through a USE PLAN hint.

Interestingly enough, running the new cardinality estimator in 2014 compatibility level with Query Optimizer Fixes off will also produce a seek plan. I found a few other ways to get the good plan, and stuck them in a bonus script.

There’s something I don’t understand happening with the cardinality estimator here (as is usual). It seems when the value being deleted from the Parent table is within its histogram, the optimizer uses stats from the Child. Otherwise, it assumes the entire Child table matches. Ouch.

Remember how my demo only has two tables? The real scenario I encountered was a cascading delete hitting forty tables, and, you guessed it, scanning each one. Block party!

Forcing plans with the help of Query Store only goes so far, especially when it’s playing whack-a-mole with ORM-generated SQL. The best solution for now might be to retreat to 2012 compatibility and complain on the internet. Bah!