Sometimes SQL Server has weird behavior that only makes sense when you learn what’s going on (though sometimes it just gets weirder). A recent post by an attentive blogger pointed out a query with GROUP BY returning ordered results, even without an ORDER BY. What’s going on here? The answer has to do with the aggregate operators in SQL Server.
Aggregates are probably one of the first topics you learned in your journey through SQL. They’re used everywhere – not just in relational databases. Open up just about any Excel sheet and you’ll find SUM(), MAX(), COUNT(), or the like. Under the hood, SQL Server uses two primary operators to actually calculate these, the Stream Aggregate and Hash Match Aggregate.
These two algorithms have strong similarities to the Merge Join and Hash Join algorithms (described here, with visualizations, which I recommend reading before continuing).
Let’s begin with an animated Stream Aggregate (using COUNT):
This is very similar to the Merge Join. The key detail is that it requires ordered input. Each successive value within a group gets aggregated, and as soon as the next sequential value shows up, the aggregation starts over.
Hash Match Aggregation works a little differently, as animated below:
Instead of requiring an ordered input, the Hash Aggregate builds a hash table, and stores aggregation details in memory. Of course, that assumes you have enough memory. Say hello to tempdb spilling if you run out. Hash Aggregate also happens to be a blocking operator – it can’t return any rows until it’s consumed all its input.
We end up with the same trade-offs we did with Merge and Hash joins. Stream Aggregate doesn’t block and doesn’t need memory, but needs an ordered input. Hash Aggregate can handle inputs in any order, but needs memory and blocks output until all rows are processed.
You can see both of these aggregates with a single table, with below sample script. Load a table with 10000 rows of identical values, clustered on the first column.
CREATE TABLE #t (ID1 int, ID2 int)
CREATE CLUSTERED INDEX CX_#t ON #t(ID1)
INSERT #t
SELECT TOP (10000)
ROW_NUMBER() OVER(ORDER BY (SELECT 1/0)),
ROW_NUMBER() OVER(ORDER BY (SELECT 1/0))
FROM master..spt_values a
CROSS JOIN master..spt_values b
Run a GROUP BY query on ID1, the clustering column, and you get a Stream Aggregate.
A GROUP BY on ID2, which SQL Server can’t guarantee is ordered, will produce a Hash Aggregate plan.
The results are no longer ordered for the Hash Aggregate plan either.
You can hopefully guess the answer to the ordering puzzle by now. The blogger was looking at a query where an existing index made a Stream Aggregate possible, and got the unexpectedly ordered results because of it. With different queries however, it’s easy to end up with unordered results as SQL Server chooses the Hash Match algorithm instead. No longer weird, but still pretty cool.
Great job Forrest… I heard you have a new place of employment?