Less than a month into my first SQL job, I was handed a report that “suddenly stopped working.” I followed the Excel sheet to a data connection to a stored procedure to a SELECT to a view on a view on a view.
The difference between the query that worked and the one that never finished (at least, over the 10 hours I let it run), was a Hash versus a Loop join. It was my first introduction to query plans and to the importance of physical joins.
But what are they? Distinct from logical joins (INNER, OUTER, CROSS, etc.), physical joins are the different algorithms available to actually compare data from different tables. You don’t have to specify which one SQL Server will use, and usually SQL Server gets it right. But there’s a large gap between usually and always, and the fine art of query tuning lives in that gap. Having a good understanding of the joins is essential for mastering tuning (if mastery is even possible), so let’s take a look.
Overview: The simplest algorithm is Nested Loops. Imagine you have two piles [tables] of values, and you want to find the matches. Take the first value from the first pile, and compare it to each value in the second pile. Then grab the next value from the first pile, and compare to everything in the second. Grab the third value…wash, rinse, repeat. If you write the pseudo-code for the algorithm, it looks something like below.
For Each value in pile1
For Each value in pile2
If pile1.value = pile2.value
Return pile1.value, pile2.value
Hey, it’s one For-Each loop inside of another loop…you might even say, Nested inside of another. Ahem. Anyways, here’s a visualization.
Drawbacks: An important detail of Nested Loops to notice is that values in the bottom (inner) input get read multiple times, which can quickly grow out of control for large tables. This isn’t an issue if an index on that inner table allows seeking to the exact match, but otherwise the amount of work to compare rows increases drastically with size.
Overview: Merge Join allows comparison while reading each input value only once. The catch is that each input has to be sorted (and on the compared values). Roughly, start at the beginning of each input and progress through the values, grabbing the next input value from whichever is lower. I have pseudo-code for Merge, but it didn’t seem helpful, so I’m skipping straight to the picture.
Drawbacks: Remember how I said that Merge reads each input value only once? I lied. There’s a special (and not uncommon) case where that’s not true. If both inputs have repeated values, called Many-to-Many, the algorithm gets really complicated. As in, holy-crap-why-is-it-taking-so-long complicated. But if you can guarantee unique values in at least one input (like my gif above shows), you’re golden. So, Merge has two big gotchas: the many-to-many case, and the requirement for both inputs to be sorted.
Overview: The final algorithm is Hash Match, also known as Hash Join. Like Merge, it only needs a single pass through each input. It’s also the most complicated. Here’s how it works (with minor
lies simplifications): there are two phases, called the build phase and probe phase. In the build phase, each value from the top input is hashed. The hash and the value are stored in a bucket. There are multiple buckets, and the hash value determines which bucket the hash and original value will be stored in. Then comes the probe phase. For each value in the second input, that value will be hashed, and the hash compared to hashes in the buckets. If the hash matches, then the values stored are checked to confirm the match. With an inner join, values that match are outputted, and otherwise discarded.
Below is my animated Hash Match, but first a note on my fake-for-learning-purposes-only hash function. I take each input number, spell it out, and take the first letter. For example, 1 becomes “one” becomes “o.” 5 becomes “five” becomes “f.”
Drawbacks: So why isn’t everything a Hash Join? Well, surprise surprise, it has its own weaknesses. The biggest one to consider is that “storing hashes and values in buckets” piece. Storing occurs in memory (hopefully), and memory is a limited resource. And, if the Hash Match needs more memory than it was given, it doesn’t get more, but writes (spills) to disk instead, killing performance.
I would probably summarize an introduction to the joins as below
|Join||Strengths||Commonly Seen||Blows Up When|
|Nested Loops||Simple, fast with the right indexes.||Small joins with proper indexes.||Inner table has lots of rows repeatedly read.|
|Merge||Very fast with sorted inputs.||Key to key joins. Query includes ORDER BY.||Necessary sorts take too long. Many-to-Many.|
|Hash||Generally fast.||Heaps. Large joins.||Can’t get enough memory.|
The three algorithms each have different strengths, weaknesses, hidden gotchas, and bizarre variants. I’ve fallen in love with the subtleties and complexities that arise from deceptively simple starting points, and it pained me to have to skip over important details (such as Merge and Hash joins requiring an equality comparison). There’s really a lot more to write about, which I intend to do…eventually. In the meantime, if you want to read more, I suggest looking at Craig Freedman’s articles and Hugo Kornelis’ site.