Sparse Stats Surprise

For some reason there are a number of incredible SQL guys named Bob at Microsoft. My personal favorite is boB (yes, boB, like Bob, just spelled backwards). I had the privilege of meeting him, and besides receiving quality SQL advice from him, I got to see boB do some great magic tricks.

I couldn’t help but think of this SQL magician when I ran across this puzzler. The Cardinality Estimator was playing tricks on me, and I imagined boB with the ol’ ball-and-cups game standing behind the screen.

Let’s start with a 2M row table where IsRow is almost always 1, and TypeID is almost always zero. (Full repro script at the bottom of this post.)

And here’s an exact count of values.

Let’s check the cardinality estimate for IsRow = 1

SELECT COUNT(*)
FROM CEPuzzler
WHERE IsRow = 1

That’s a pretty good estimate. Let’s try for TypeID = 1

SELECT COUNT(*)
FROM CEPuzzler
WHERE TypeID = 1

Not perfect, but it’s still a small fraction of the size of the table, so I’d call it a good estimate.

What happens when we combine the two prior filters?

SELECT COUNT(*)
FROM CEPuzzler
WHERE IsRow = 1
AND TypeID = 1
What the heck?!

boB would execute a suave magician’s pose here, so, uh, imagine me flourishing a cape or something. Really though, I just let out a giant, “What the heck?!” when I found this. Somehow making the WHERE clause more restrictive made the row estimate increase.

Unlike boB, I’m going to tell you how the trick works. The short version (the one not involving 2363) boils down to density vectors and out-of-range stats.

There are so few TypeID = 1 values that they don’t get picked up in the stats sampling. Thus 1 ends up with the same estimate as other non-existent values, sqrt(2000000) = 1414.

SELECT COUNT(*)
FROM CEPuzzler
WHERE TypeID = -99999

I have a couple indexes on the table, and one of them uses both TypeID and IsRow. When the current CE encounters an out-of-range value, it defaults to using the density vector, which leads to the ridiculous estimate of half the table.

DBCC SHOW_STATISTICS('dbo.CEPuzzler',IX_CEPuzzler_TypeID_IsRow)

There are a couple of potential fixes here (and 4199 is not one of them). Perhaps the most interesting to demote IsRow in the previous index from an indexed column to a mere included column. That keeps it from showing up in the density stats, and leads to better estimates.

But there’s a better solution:

SELECT COUNT(*)
FROM CEPuzzler
WHERE IsRow = 1
AND TypeID = 1
OPTION(USE HINT('FORCE_LEGACY_CARDINALITY_ESTIMATION'))
#TeamLegacy

Abacadabra!

--repro script

CREATE TABLE CEPuzzler (
ID int identity primary key,
IsRow bit not null,
TypeID int not null,
junk char(100)
)

CREATE INDEX IX_CEPuzzler_IsRow ON dbo.CEPuzzler(IsRow)
CREATE INDEX IX_CEPuzzler_TypeID_IsRow ON dbo.CEPuzzler(TypeID,IsRow)

INSERT CEPuzzler
SELECT TOP (2000000)
1-(ROW_NUMBER() OVER(ORDER BY 1/0))%20001/20000, --almost all 1s
(ROW_NUMBER() OVER(ORDER BY 1/0))%400001/400000, --almost all 0s
'junk'
FROM master..spt_values a
CROSS JOIN master..spt_values b
CROSS JOIN master..spt_values c

--your instance may, through differently random sampling,
--pick up some of the rare values, giving you different results
UPDATE STATISTICS CEPuzzler WITH SAMPLE 1 PERCENT

--Here's a query closer to what I encountered in prod,
--showing why these bad stats matter.
--Try it with legacy CE, and see how much faster it is.
--Or just update stats with fullscan...whatever
SELECT junk
FROM CEPuzzler
WHERE IsRow = 1
AND TypeID = 1
--OPTION(USE HINT('FORCE_LEGACY_CARDINALITY_ESTIMATION'))