Cardinality in a Shared Membership Query

Take a look at this. Pay attention to the estimated rows. See the problem?

Even at a full cartesian join, 49 by 49 rows does not produce 27670.

Usually, when the optimizer applies transformations it updates cardinality estimates, but I ran into a particular construction and transformation where it doesn’t (and I don’t know why).

Here’s the basic idea for my demo query. You have three tables, Students, Towns, and Fraternities. Each student has a hometown and a fraternity. For a given list of hometowns, what fraternities have members from those towns?

One possible way to write the query is like this:

--using a temp table to take the place of a TVP 
--as encountered in a real production scenario
SELECT TOP(2) TownId, TownName
INTO #t
FROM dbo.Towns
WHERE TownName LIKE '%Dallas%'

SELECT f.Fraternity, t.TownName
FROM #t t
CROSS JOIN Fraternities f
WHERE EXISTS (
	SELECT 1
	FROM dbo.Students s
	WHERE t.TownId = s.HometownId
	AND s.FraternityId = f.FraternityId
	)

If the cross join weirds you out, note that it was disguised when I first found it with a one-sided ON filter. E.g.

SELECT f.Fraternity, t.TownName
FROM #t t
JOIN Fraternities f --this is still a cross join
ON f.RandomColumn = @Param

This is the full plan produced, across two pictures. SQL Server doesn’t actually do a cross join, thankully, but starts with the small temp table to find existing Students with matching hometown.

It then uses the FraternityIds from these students to look up data in the Fraternity table, and finally aggregates the data to remove duplicates (to preserve the semi join requirements).

So What?

As encountered in my real production issue, the bad cardinality estimate caused the optimzer to shove inappropriately large tables up before the join, causing timeouts. Cartesian cardinalities can get pretty high, so what should have been a good, filtering join was getting delayed. And no, I didn’t bother with a demo for this part, sorry.

A Fix

There’s another way to write the query that’s extremely similar to the above plan SQL Server decided to use anyways:

SELECT f.Fraternity, t.TownName
FROM Fraternities f
join (
	SELECT DISTINCT t.TownName, s.FraternityId
	FROM dbo.Students s
	JOIN #t t
	ON t.TownId = s.HometownId
	) t
ON t.FraternityId = f.FraternityId

The difference here is that the aggregation is done before the join to Fraternity

What’s Going On?

The high estimate comes from the cartesian product of the temp table (2 rows) and Fraternities (13824 rows) because of the cross join. It just sticks around.

Looking into the used transformations, I identified LSJNtoJNonDist as key. Here’s the plan with OPTION(queryruleoff LSJNtoJNonDist)

This transformation failed to update the cardinality, yet others, like commuting joins, did. Here’s an example where adding a one row table to the query gets it moved earlier and updates cardinality.

SELECT 1 AS Id
INTO #onerow

SELECT f.Fraternity
FROM #t t
CROSS JOIN Fraternities f
JOIN #onerow p
ON f.FraternityId = p.Id
WHERE EXISTS (
	SELECT 1
	FROM dbo.Students s
	WHERE t.TownId = s.HometownId
	AND s.FraternityId = f.FraternityId
	)

And here is approximately where I’m stuck, barring a large further investment of time. I’d like to find another query that relies on LSJNtoJNonDist and see if it fails to update cardinality as well, but my first few attempts ran into different issues (like an early aggregate instead of after the join).

It’s unsatisfying to me – I don’t know whether this is truly a bug or just an expected artifact of the optimization process, but I have a workaround and I’d rather continue yelling at the AI I’m teaching to read query plans.

Setup Script:

USE TestDB
GO

CREATE TABLE dbo.Towns (
TownID INT IDENTITY PRIMARY KEY
,TownName VARCHAR(500) NOT NULL
)

CREATE TABLE dbo.Students (
StudentId INT IDENTITY PRIMARY KEY
,StudentName VARCHAR(100) NOT null
,FraternityId INT NULL
,HometownId INT NOT NULL
)

CREATE INDEX IX_Students_FraternityId
ON dbo.Students (FraternityId)

CREATE INDEX IX_Students_HometownId
ON dbo.Students (HometownId)

CREATE TABLE dbo.Fraternities (
FraternityId INT IDENTITY PRIMARY KEY
,Fraternity NCHAR(3) NOT null
,FraternityName VARCHAR(100) NOT null
)

INSERT dbo.Towns (TownName)
SELECT DISTINCT Location
FROM StackOverflow2010.dbo.Users 
WHERE TRY_CONVERT(VARCHAR(100),Location) IS NOT NULL

;WITH GreekLetters AS (
    SELECT 'Alpha' AS LetterName, N'Α' AS UpperCase, 'α' AS LowerCase, 1 AS SortOrder
    UNION ALL SELECT 'Beta', N'Β', 'β', 2
    UNION ALL SELECT 'Gamma', N'Γ', 'γ', 3
    UNION ALL SELECT 'Delta', N'Δ', 'δ', 4
    UNION ALL SELECT 'Epsilon', N'Ε', 'ε', 5
    UNION ALL SELECT 'Zeta', N'Ζ', 'ζ', 6
    UNION ALL SELECT 'Eta', N'Η', 'η', 7
    UNION ALL SELECT 'Theta', N'Θ', 'θ', 8
    UNION ALL SELECT 'Iota', N'Ι', 'ι', 9
    UNION ALL SELECT 'Kappa', N'Κ', 'κ', 10
    UNION ALL SELECT 'Lambda', N'Λ', 'λ', 11
    UNION ALL SELECT 'Mu', N'Μ', 'μ', 12
    UNION ALL SELECT 'Nu', N'Ν', 'ν', 13
    UNION ALL SELECT 'Xi', N'Ξ', 'ξ', 14
    UNION ALL SELECT 'Omicron', N'Ο', 'ο', 15
    UNION ALL SELECT 'Pi', N'Π', 'π', 16
    UNION ALL SELECT 'Rho', N'Ρ', 'ρ', 17
    UNION ALL SELECT 'Sigma', N'Σ', 'σ', 18
    UNION ALL SELECT 'Tau', N'Τ', 'τ', 19
    UNION ALL SELECT 'Upsilon', N'Υ', 'υ', 20
    UNION ALL SELECT 'Phi', N'Φ', 'φ', 21
    UNION ALL SELECT 'Chi', N'Χ', 'χ', 22
    UNION ALL SELECT 'Psi', N'Ψ', 'ψ', 23
    UNION ALL SELECT 'Omega', N'Ω', 'ω', 24
),

-- Generate all combinations using cross joins
AllCombinations AS (
    SELECT 
        L1.UpperCase + L2.UpperCase + L3.UpperCase AS GreekLetters,
        L1.LetterName + ' ' + L2.LetterName + ' ' + L3.LetterName AS FullName
    FROM GreekLetters L1
    CROSS JOIN GreekLetters L2
    CROSS JOIN GreekLetters L3
)

INSERT dbo.Fraternities(Fraternity,FraternityName)
SELECT GreekLetters,FullName
FROM AllCombinations

INSERT dbo.Students(StudentName,FraternityId,HometownId)
SELECT DisplayName
,FLOOR(RAND(CONVERT(VARBINARY(20),NEWID()))*13824+1)
,FLOOR(RAND(CONVERT(VARBINARY(20),NEWID()))*12265+1)
FROM StackOverflow2010.dbo.Users 
WHERE TRY_CONVERT(VARCHAR(100),DisplayName) IS NOT null

One thought on “Cardinality in a Shared Membership Query”

Leave a Reply to Brent Ozar Cancel reply

Your email address will not be published. Required fields are marked *