Are Bigint Splits a GUID Idea?

I like fast databases. Naturally, this puts me at odds with GUIDs. Alas, developers like to use them, sometimes for good reasons (“we need ids generated on different servers to be unique”) and often for bad (“what if we run out of bigints”).

The Columnstore Problem

You may have run into issues with GUIDs as clustering keys, but another major problem is in columnstore. Smart people at Microsoft wrote columnstore in a way to take advantage of modern CPU features, but those CPU features don’t play well with datatypes larger than 8 bytes. Which includes GUIDs.

Let me show you: here’s a select on a GUID

And here’s a select on a bigint

We Need to Split Up

Crazy idea time: a uniqueidentifier is 16 bytes…what if we just turned that into two 8 byte bigint columns?

create table #t (
UID uniqueidentifier not null
)

create table dbo.GUID_IDEA (
UID uniqueidentifier not null,
b1 bigint,
b2 bigint,
index CCX_GUID clustered columnstore
)

insert #t
select top (5000000) newid()
from sys.all_columns a, sys.all_columns b, sys.all_columns c

insert dbo.GUID_IDEA (UID, b1, b2)
select UID
,convert(bigint,RIGHT_SHIFT(convert(binary(16),UID),64)) --there are several ways to split - this one is 2022+
,convert(bigint,RIGHT_SHIFT(LEFT_SHIFT(convert(binary(16),UID),64),64))
from #t
The Demo

Split done. We have a 5m row table storing both the full GUID and its decomposed bigints. Time to test it. Load a temp table with some sample values.

create table #driver(
UID uniqueidentifier,
b1 bigint,
b2 bigint
)
go

insert #driver
select *
from dbo.GUID_IDEA tablesample(5 rows)
go 5 --this lets us grab rows from different locations

Now, how does a GUID-only join perform?

select *
from #driver d
join dbo.GUID_IDEA gi
on gi.UID = d.UID

And what if we include the bigints on the join?

select *
from #driver d
join dbo.GUID_IDEA gi
on gi.UID = d.UID
and gi.b1 = d.b1
and gi.b2 = d.b2

According to the plan details, that’s 268ms CPU down to 21ms, huge win! A 16 byte predicate can’t be pushed down here, but two separate 8 byte predicates (or rather, their bitmap filters) can.

This also works for where clauses.

Side Notes:

See this? Should be fast, right?

select *
from dbo.GUID_IDEA gi
where b1 = 3511787043328731458

Except…

Whereas this one is fast:

select *
from dbo.GUID_IDEA gi
where b1 = 42

This happens because the parser is just looking at the number of digits, and if there are as many digits as max bigint (and thus at risk of being larger than a bigint), it stops treating the number like a bigint.

Thankfully the fix is easy:

select *
from dbo.GUID_IDEA gi
where b1 = convert(bigint,3511787043328731458)

Another question you might have is why bigints? Why not just use two binary(8) columns.

Alas…

create table dbo.BETTER_IDEA (
UID uniqueidentifier not null
,p1 binary(8) not null
,p2 binary(8) not null
,index ccx_better clustered columnstore 
)

...blah blah load it and split it, you can figure it out

select *
from BETTER_IDEA
where p1 = 0x97E6A3BAC4E4A64F

This is one I don’t have an explanation for, and may need to send some wine to New Zealand for an answer. Edit: the Wizard from the Other Down Under provided an answer: Columnstore never stores binary data in a pushdown friendly format. Actual explanation here.

Conclusion

I have never done this in production, and I’m not sure I want to even with a 20X speedup – simple designs and getting data right the first time tend to outperform fancy tricks. Maybe, just maybe, it’s worth bolting on to a calcified schema where it’s the only way. If you do try this yourself, let me know how it goes!

Cardinality in a Shared Membership Query

Take a look at this. Pay attention to the estimated rows. See the problem?

Even at a full cartesian join, 49 by 49 rows does not produce 27670.

Usually, when the optimizer applies transformations it updates cardinality estimates, but I ran into a particular construction and transformation where it doesn’t (and I don’t know why).

Here’s the basic idea for my demo query. You have three tables, Students, Towns, and Fraternities. Each student has a hometown and a fraternity. For a given list of hometowns, what fraternities have members from those towns?

One possible way to write the query is like this:

--using a temp table to take the place of a TVP 
--as encountered in a real production scenario
SELECT TOP(2) TownId, TownName
INTO #t
FROM dbo.Towns
WHERE TownName LIKE '%Dallas%'

SELECT f.Fraternity, t.TownName
FROM #t t
CROSS JOIN Fraternities f
WHERE EXISTS (
	SELECT 1
	FROM dbo.Students s
	WHERE t.TownId = s.HometownId
	AND s.FraternityId = f.FraternityId
	)

If the cross join weirds you out, note that it was disguised when I first found it with a one-sided ON filter. E.g.

SELECT f.Fraternity, t.TownName
FROM #t t
JOIN Fraternities f --this is still a cross join
ON f.RandomColumn = @Param

This is the full plan produced, across two pictures. SQL Server doesn’t actually do a cross join, thankully, but starts with the small temp table to find existing Students with matching hometown.

It then uses the FraternityIds from these students to look up data in the Fraternity table, and finally aggregates the data to remove duplicates (to preserve the semi join requirements).

So What?

As encountered in my real production issue, the bad cardinality estimate caused the optimzer to shove inappropriately large tables up before the join, causing timeouts. Cartesian cardinalities can get pretty high, so what should have been a good, filtering join was getting delayed. And no, I didn’t bother with a demo for this part, sorry.

A Fix

There’s another way to write the query that’s extremely similar to the above plan SQL Server decided to use anyways:

SELECT f.Fraternity, t.TownName
FROM Fraternities f
join (
	SELECT DISTINCT t.TownName, s.FraternityId
	FROM dbo.Students s
	JOIN #t t
	ON t.TownId = s.HometownId
	) t
ON t.FraternityId = f.FraternityId

The difference here is that the aggregation is done before the join to Fraternity

What’s Going On?

The high estimate comes from the cartesian product of the temp table (2 rows) and Fraternities (13824 rows) because of the cross join. It just sticks around.

Looking into the used transformations, I identified LSJNtoJNonDist as key. Here’s the plan with OPTION(queryruleoff LSJNtoJNonDist)

This transformation failed to update the cardinality, yet others, like commuting joins, did. Here’s an example where adding a one row table to the query gets it moved earlier and updates cardinality.

SELECT 1 AS Id
INTO #onerow

SELECT f.Fraternity
FROM #t t
CROSS JOIN Fraternities f
JOIN #onerow p
ON f.FraternityId = p.Id
WHERE EXISTS (
	SELECT 1
	FROM dbo.Students s
	WHERE t.TownId = s.HometownId
	AND s.FraternityId = f.FraternityId
	)

And here is approximately where I’m stuck, barring a large further investment of time. I’d like to find another query that relies on LSJNtoJNonDist and see if it fails to update cardinality as well, but my first few attempts ran into different issues (like an early aggregate instead of after the join).

It’s unsatisfying to me – I don’t know whether this is truly a bug or just an expected artifact of the optimization process, but I have a workaround and I’d rather continue yelling at the AI I’m teaching to read query plans.

Setup Script:

USE TestDB
GO

CREATE TABLE dbo.Towns (
TownID INT IDENTITY PRIMARY KEY
,TownName VARCHAR(500) NOT NULL
)

CREATE TABLE dbo.Students (
StudentId INT IDENTITY PRIMARY KEY
,StudentName VARCHAR(100) NOT null
,FraternityId INT NULL
,HometownId INT NOT NULL
)

CREATE INDEX IX_Students_FraternityId
ON dbo.Students (FraternityId)

CREATE INDEX IX_Students_HometownId
ON dbo.Students (HometownId)

CREATE TABLE dbo.Fraternities (
FraternityId INT IDENTITY PRIMARY KEY
,Fraternity NCHAR(3) NOT null
,FraternityName VARCHAR(100) NOT null
)

INSERT dbo.Towns (TownName)
SELECT DISTINCT Location
FROM StackOverflow2010.dbo.Users 
WHERE TRY_CONVERT(VARCHAR(100),Location) IS NOT NULL

;WITH GreekLetters AS (
    SELECT 'Alpha' AS LetterName, N'Α' AS UpperCase, 'α' AS LowerCase, 1 AS SortOrder
    UNION ALL SELECT 'Beta', N'Β', 'β', 2
    UNION ALL SELECT 'Gamma', N'Γ', 'γ', 3
    UNION ALL SELECT 'Delta', N'Δ', 'δ', 4
    UNION ALL SELECT 'Epsilon', N'Ε', 'ε', 5
    UNION ALL SELECT 'Zeta', N'Ζ', 'ζ', 6
    UNION ALL SELECT 'Eta', N'Η', 'η', 7
    UNION ALL SELECT 'Theta', N'Θ', 'θ', 8
    UNION ALL SELECT 'Iota', N'Ι', 'ι', 9
    UNION ALL SELECT 'Kappa', N'Κ', 'κ', 10
    UNION ALL SELECT 'Lambda', N'Λ', 'λ', 11
    UNION ALL SELECT 'Mu', N'Μ', 'μ', 12
    UNION ALL SELECT 'Nu', N'Ν', 'ν', 13
    UNION ALL SELECT 'Xi', N'Ξ', 'ξ', 14
    UNION ALL SELECT 'Omicron', N'Ο', 'ο', 15
    UNION ALL SELECT 'Pi', N'Π', 'π', 16
    UNION ALL SELECT 'Rho', N'Ρ', 'ρ', 17
    UNION ALL SELECT 'Sigma', N'Σ', 'σ', 18
    UNION ALL SELECT 'Tau', N'Τ', 'τ', 19
    UNION ALL SELECT 'Upsilon', N'Υ', 'υ', 20
    UNION ALL SELECT 'Phi', N'Φ', 'φ', 21
    UNION ALL SELECT 'Chi', N'Χ', 'χ', 22
    UNION ALL SELECT 'Psi', N'Ψ', 'ψ', 23
    UNION ALL SELECT 'Omega', N'Ω', 'ω', 24
),

-- Generate all combinations using cross joins
AllCombinations AS (
    SELECT 
        L1.UpperCase + L2.UpperCase + L3.UpperCase AS GreekLetters,
        L1.LetterName + ' ' + L2.LetterName + ' ' + L3.LetterName AS FullName
    FROM GreekLetters L1
    CROSS JOIN GreekLetters L2
    CROSS JOIN GreekLetters L3
)

INSERT dbo.Fraternities(Fraternity,FraternityName)
SELECT GreekLetters,FullName
FROM AllCombinations

INSERT dbo.Students(StudentName,FraternityId,HometownId)
SELECT DisplayName
,FLOOR(RAND(CONVERT(VARBINARY(20),NEWID()))*13824+1)
,FLOOR(RAND(CONVERT(VARBINARY(20),NEWID()))*12265+1)
FROM StackOverflow2010.dbo.Users 
WHERE TRY_CONVERT(VARCHAR(100),DisplayName) IS NOT null