I’ve read it repeatedly, columnstore key lookups are extra slow. The question of course, is why?
In my mental model, it makes sense. A normal key lookup adds about 3 reads.
While a columnstore lookup should add at least a read per column, since each column lives in its own segments.
Logically the next step is to test it. I made a copy of the StackOverflow2013 Posts table and slapped clustered columnstore and a nonclustered index on it.
SELECT TOP (1) * FROM dbo.PostsCS WITH(FORCESEEK) WHERE ID > 0
There are 20 columns, so I should expect at least 20 extra reads on top of the inital seek, right?
That’s…a lot more than 20 extra reads. What’s going on?
Segment Lookups, not Row Lookups
First off, each key lookup per column doesn’t just read the page it’s on, but seems to read the entire segment. This can be tested by comparing lob reads to segment size – here are some sample scripts:
SELECT c.name, s.on_disk_size/1024.0/8 AS approx_pages FROM sys.column_store_segments s JOIN sys.partitions p ON p.partition_id = s.partition_id JOIN sys.columns c ON c.object_id = p.object_id AND c.column_id = s.column_id WHERE p.object_id = OBJECT_ID('dbo.PostsCS') AND s.segment_id = 11 --I used segment min/max values to confirm this is the correct one
SET STATISTICS IO ON SELECT TOP (1) LastEditDate --lob logical reads 49, segment size 48 FROM dbo.PostsCS WITH(FORCESEEK) WHERE ID > 0
So it’s more like this:
Dictionaries
It gets worse.
The following query has 7135 lob reads
SELECT TOP (1) Body FROM dbo.PostsCS WITH(FORCESEEK) WHERE ID > 0
Which should make some sense, given that body
is nvarchar(max), but the segment is only 43 pages. 43 pages is rather small for lots of nvarchar(max) data isn’t it? That’s because SQL Server is actually storing the data in dictionaries.
So when the key lookup occurs for a column with a dictionary, not only does the whole segment get read, but the dictionary gets read too. And there might be more than one dictionary.
This is a lot of reads.
And a Mystery
Something seems to happen with dictionary-backed columns where there are more reads than necessary, and a lot of read-ahead reads. I haven’t figured out why, and plan to tackle this next.
Recommendations
Columnstore key lookups can have the disastrous performance that people mistakenly attribute to normal key lookups. When you do need to use them, try to 1) pull as few columns as possible, 2) as few rows as possible, and 3) avoid dictionary-backed columns when possible.