The Three Physical Joins, Visualized

Less than a month into my first SQL job, I was handed a report that “suddenly stopped working.” I followed the Excel sheet to a data connection to a stored procedure to a SELECT to a view on a view on a view.
The difference between the query that worked and the one that never finished (at least, over the 10 hours I let it run), was a Hash versus a Loop join. It was my first introduction to query plans and to the importance of physical joins.

But what are they? Distinct from logical joins (INNER, OUTER, CROSS, etc.), physical joins are the different algorithms available to actually compare data from different tables. You don’t have to specify which one SQL Server will use, and usually SQL Server gets it right. But there’s a large gap between usually and always, and the fine art of query tuning lives in that gap. Having a good understanding of the joins is essential for mastering tuning (if mastery is even possible), so let’s take a look.

Nested Loops

Overview: The simplest algorithm is Nested Loops. Imagine you have two piles [tables] of values, and you want to find the matches. Take the first value from the first pile, and compare it to each value in the second pile. Then grab the next value from the first pile, and compare to everything in the second. Grab the third value…wash, rinse, repeat. If you write the pseudo-code for the algorithm, it looks something like below.

Hey, it’s one For-Each loop inside of another loop…you might even say, Nested inside of another. Ahem. Anyways, here’s a visualization.

Drawbacks: An important detail of Nested Loops to notice is that values in the bottom (inner) input get read multiple times, which can quickly grow out of control for large tables. This isn’t an issue if an index on that inner table allows seeking to the exact match, but otherwise the amount of work to compare rows increases drastically with size.

Merge Join

Overview: Merge Join allows comparison while reading each input value only once. The catch is that each input has to be sorted (and on the compared values). Roughly, start at the beginning of each input and progress through the values, grabbing the next input value from whichever is lower. I have pseudo-code for Merge, but it didn’t seem helpful, so I’m skipping straight to the picture.

Drawbacks: Remember how I said that Merge reads each input value only once? I lied. There’s a special (and not uncommon) case where that’s not true. If both inputs have repeated values, called Many-to-Many, the algorithm gets really complicated. As in, holy-crap-why-is-it-taking-so-long complicated. But if you can guarantee unique values in at least one input (like my gif above shows), you’re golden. So, Merge has two big gotchas: the many-to-many case, and the requirement for both inputs to be sorted.

Hash Match

Overview: The final algorithm is Hash Match, also known as Hash Join. Like Merge, it only needs a single pass through each input. It’s also the most complicated. Here’s how it works (with minor lies simplifications): there are two phases, called the build phase and probe phase. In the build phase, each value from the top input is hashed. The hash and the value are stored in a bucket. There are multiple buckets, and the hash value determines which bucket the hash and original value will be stored in. Then comes the probe phase. For each value in the second input, that value will be hashed, and the hash compared to hashes in the buckets. If the hash matches, then the values stored are checked to confirm the match. With an inner join, values that match are outputted, and otherwise discarded.

Below is my animated Hash Match, but first a note on my fake-for-learning-purposes-only hash function. I take each input number, spell it out, and take the first letter. For example, 1 becomes “one” becomes “o.” 5 becomes “five” becomes “f.”

Drawbacks: So why isn’t everything a Hash Join? Well, surprise surprise, it has its own weaknesses. The biggest one to consider is that “storing hashes and values in buckets” piece. Storing occurs in memory (hopefully), and memory is a limited resource. And, if the Hash Match needs more memory than it was given, it doesn’t get more, but writes (spills) to disk instead, killing performance.

I would probably summarize an introduction to the joins as below

JoinStrengthsCommonly SeenBlows Up When
Nested LoopsSimple, fast with the right indexes.Small joins with proper indexes.Inner table has lots of rows repeatedly read.
MergeVery fast with sorted inputs.Key to key joins. Query includes ORDER BY.Necessary sorts take too long. Many-to-Many.
HashGenerally fast.Heaps. Large joins.Can’t get enough memory.

The three algorithms each have different strengths, weaknesses, hidden gotchas, and bizarre variants. I’ve fallen in love with the subtleties and complexities that arise from deceptively simple starting points, and it pained me to have to skip over important details (such as Merge and Hash joins requiring an equality comparison). There’s really a lot more to write about, which I intend to do…eventually. In the meantime, if you want to read more, I suggest looking at Craig Freedman’s articles and Hugo Kornelis’ site.

So is NOLOCK Actually Faster?

NOLOCK has a reputation. Several reputations in fact. To the database snob, NOLOCK is a vestigial practice of the untrained masses, a Pavlovian reflex borne [yes, borne] of fossilized ignorance and apathy. Millions of man-hours of coding have gone into building robust ACID databases that guarantee consistency… to be casually thrown away by peons who can’t be bothered to actually fix their design.

For others, NOLOCK is the two-click salvation when the system is down, the company is losing thousands of dollars a minute, and the full weight of the corporate totem pole is pressing down. Incorrect results? Sure, in theory. Never seen one. NOLOCK works, so why not keep using it?

What about another reputation? That NOLOCK is the “go faster” hint? This one can be tested, and I love testing! There are three main ways it might work:
1) because it doesn’t get blocked
2) because it doesn’t have the overhead of taking locks
3) allocation order scans

Using NOLOCK to get around blocking is…I’m trying to think of a word here…dumb. I’m not talking about running an ad-hoc query on a busy production system where exact results don’t matter. I’m talking about adding the hint to EVERYTHING as a matter of practice. If your queries are slow from blocking, you need to fix how much data they’re reading, often through indexes, or move to RCSI. But others have written more on this.

Regarding allocation order scans, they help the most on severely fragmented tables – the ones with a GUID as clustering key for example. They also seem to help with physical reads. In fact, if you search for “NOLOCK performance testing,” the top result shows using the hint is faster, but the reason in that test is the different scan type. But when your data is in memory and not especially fragmented (you know, the way you’d expect when you’re not being bad at databases), the allocation scan doesn’t help as much.

So what about the second scenario? Ignoring blocking, ignoring allocation scans, does the locking overhead make a difference? Let’s find out!

Here’s my setup – four tables wide and narrow, max column and not.

I then run the following SELECT statements that scan the tables. The WHERE clause makes sure no rows get returned and sent over the network. I use a warm cache because that’s a bit more realistic for most environments.

Average scan speeds (in ms) are as follow:

Table Normal NOLOCK
narrow 51 53
wide 794 525
narrow_max 53 52
wide_max 1433 1124

So yes, it looks like NOLOCK is faster, but only on wider tables in a scan situation. Seriously though, if your system is running a lot of scans on really wide tables, you might want to reconsider if speeding them up is your real priority.

What about an OLTP-style test, with a lot of single-row seeks? I used SQL Query Stresser with 40 threads of 10s WHILE loops and the below procs.

And here’s what I used to grab execution stats:

And here are my results, measured in executions, averaged over three runs.

Table Normal NOLOCK
narrow 11,059,073 10,530,684
wide 10,770,979 10,838,683
narrow_max 10,821,620 10,533,960
wide_max 10,443,677 10,035,070

The runs had enough variance to them that I’m not confident in saying a particular method is a clear loser, but it should be obvious that NOLOCK is no help here.

And overall I think NOLOCK is a loser. Sure, it’s faster in scans on wide tables, but maybe you should fix the scans and the wide tables, instead of slapping on a bandaid. You know, you could TABLOCK hint for the same effect, without risking correctness.

There’s more I could test, and probably will. It turns out databases are complicated, or something. NOLOCK performance has long interested me – my only (as of this moment) dba.stackexchange question is about a case where NOLOCK is slower. In the end, though, I admit that I’m a database snob. Stop using NOLOCK, and fix your design!

Gotta Cache ‘Em All: Query Store Survey

I have a friend who’s really into Pokemon. Really into Pokemon. I had no idea there was such a thing as a “shiny” pokemon until he, uh, graciously spent a whole evening lecturing me on their value and capture.

It’s a passion I don’t really get, but have to tolerate, maybe even respect. It turns out I’m the same, just with different monsters. You see, I really like plan operators. I like knowing their behavior, I like learning their natural habitat, and I love hunting for exotics. If I could find a way to stuff and mount Flow Distinct on my wall, I probably would.

There a handful of ways to see what operators roam your production environment. If you’re lucky enough to be working with 2016 or later
(I’ve seen the stats though, condolences for those of you still on 2005), you can peek into Query Store. Actually, with how long the query takes to run, it might be more of an expedition than a peek. Caveat executor.Ā 

Here’s the script – again, mind that this can be an expensive query.

The most common operator for me, at a whopping quarter-million occurrences, is the Clustered Index Seek. That’s probably a good thing, since I do OLTP. I still see plenty of scans, because developers, and a decent range of not-rare-but-not-common operators, like the Table Spool (hmm, I should probably look into that query…).

Then there’s the esoteric bottom of the list, where operators are only seen a couple of places instead of a quarter-million. One of these was new to me: Inserted Scan, which comes from the Inserted pseudo-table in a trigger. I have a couple oddities like Parameter Table Scan and Hash Match for a Full Outer Join.

Of course, hiding down there are three shiny Flow Distincts. Time for me to hunt down a repro!

Cascading Deletes – Cardinality

Who knew cascading deletes could be so horribly entertaining (emphasis on horrible)? The problems I wrote about traced back to cardinality, but I wasn’t ready to write about it. So, uh, I might have spent most of a perfectly good weekend reading Dmitry and Microsoft patents.

The scan plan was coming from a really bad cardinality assumption: given a Parent ID sufficiently outside the stats histogram, *every* Child table row would join to it. But why was I getting such a ridiculous estimate?

Time for an undocumented trace flag (gosh, I love these)! Dmitry’s post introduced me to trace flag 2363, which outputs cardinality estimation info for the new estimator. I tossed in a QUERYTRACEON, and got the following.

Coooooooool! The optimizer considers stats on Parent.ID to be a badly-formed histogram, whatever that is. Moreover, it just gives up, moves to CSelCalcGuessComparisonJoin, and gives selectivity of 1 (aka everything). Nice guess.

What happens when the ID is within range?

Some huge amount of testing later, I’m able to determine that the bad bailout only occurs when 1) Parent stats are “badly-formed” and 2) the ID is out of range. Stats updated with a FULLSCAN aren’t badly-formed, which seems to be why the problem disappears in that circumstance.

Also interesting is what happens in 2014 compatibility with QO Fixes off.

It ejects, but uses CSelCalcSimpleJoin, with a much more reasonable guess.

I kind of have my answer at this point, which that the reason for crappy cardinality is buried deep within the proprietary bowels of SQL Server’s estimator design. Luckily there was a holiday sale on unsatisfying answers, and I got a new question for free.

What the heck is a “badly-formed histogram?”

Spelunking with windbg left me in the dark, so I resorted to empirical testing with stats donating. The best I can tell, a badly-formed histogram only occurs for a column that is unique and has decimal-valued range_rows. In turn, decimal-valued range_rows only seem to occur for large-enough tables with sampled statistics. How’s that for a summary of weeks of investigation?

To be honest, I’m not really happy with the state of my understanding here. My answers are more anecdote than explanation, only looking at a lower layer of turtle. Maybe if I booze up a Microsoft employee enough I can get them to share info outside of NDA…

Anyways, I’m off to read more patents.

Cascading Delete Troubles

I had an interesting conversation with a developer the other day – it went something like this:

Developer: “Why does it take three minutes to delete a single row from this table?”

Me: “Hold on while I look at the plan. Oh, it’s because SQL Server is checking the 460 foreign keys that reference that table, 447 of which don’t have a supporting index.”

I know the pains of foreign keys without indexes. Oh, do I know. So when I saw a cascading single-row delete plan with a scan on the child table, I thought, “Aha! I know how to fix this, an index!”

But the index was already there, a perfect one for supporting that foreign key with the abominable ON DELETE CASCADE . Something was obviously wrong, and I immediately suspected our recent switch to 2016 compatibility level – meaning a new cardinality estimator.

Sure enough, in 2012 mode the query optimizer produced a seek.

I was able to come up with a repro of the issue (which is what the above plans are from).

The short of it is, with any value too far outside of the histogram of the parent table, the new cardinality estimator assumes that EVERY SINGLE ROW of the child table will get deleted.

This can be seen even when a seek is forced through a USE PLAN hint.

Interestingly enough, running the new cardinality estimator in 2014 compatibility level with Query Optimizer Fixes off will also produce a seek plan. I found a few other ways to get the good plan, and stuck them in a bonus script.

There’s something I don’t understand happening with the cardinality estimator here (as is usual). It seems when the value being deleted from the Parent table is within its histogram, the optimizer uses stats from the Child. Otherwise, it assumes the entire Child table matches. Ouch.

Remember how my demo only has two tables? The real scenario I encountered was a cascading delete hitting forty tables, and, you guessed it, scanning each one. Block party!

Forcing plans with the help of Query Store only goes so far, especially when it’s playing whack-a-mole with ORM-generated SQL. The best solution for now might be to retreat to 2012 compatibility and complain on the internet. Bah!

Debugging for Trace Flags

Good thing I have a test machine, because I just crashed SQL Server again. Prepare yourselves for (yet another) post about doing horrible things to the database for fun and profit more fun.

Obligatory “blah blah serious warning blah blah DON’T DO THIS IN PRODUCTION blah blah or you’ll go blind blah blah.” Ok, let’s get started! I have a documented obsession with undocumented trace flags and have long wanted to use the debugger to discover new ones. I just didn’t comprehend what I saw in windbg until I learned about registers and how a program actually runs. Thanks, Crash Course Computing!

Armed with an incomplete understanding of fundamentals and a googled list of windbg commands, I hypothesized that if I turned on some trace flags for a query, they would eventually be passed through one of the registers. From there, I could work out the mechanisms involved and maybe even how to find new flags!

Running a query with a ridiculously long compile time let me capture optimization call stacks with PerfView. The common root of those seemed to be sqllang!CSQLSource::Transform, so I set a breakpoint on it. F11 let me step into the various classes [I don’t even know if I’m using the right terms here] called, and I eventually hit a lucky break.

Behold, sqldk!CSessionTraceFlags::CheckSessionTraceInternal

Clues just don’t get more obvious than that.

Setting a breakpoint on that function [module? class?] revealed that it gets called a lot. I kept F11-ing through the function while watching the registers. I had turned on several trace flags and was looking for their hex forms, but didn’t see them. Eventually, I ran across a call to CheckSessionTraceInternal originating from sqllang!CExecLvlIntCtxt::FDontExpandPersistedCC. “Aha!” I thought (and might have said out loud) “I know what trace flag prevents expanding persisted computed columns.” A glance at my favorite list of trace flags confirmed what I remembered; I was looking at TF 176. The windows calculator has a handy hex converter in its “Progammer” setting, and I quickly determined I would be looking for “b0” in a register.


The pattern soon became obvious.Ā CheckSessionTraceInternal is called whenever a function checks whether a trace flag is turned on. Luckily, whenĀ CheckSessionTraceInternal is called, the value of the trace flag it’s checking is passed through register rdx (I think technically edx, but don’t ask me the difference yet) and then stored in r10.

I spent a dozen-plus hours hitting breakpoints onĀ CheckSessionTraceInternal, and made a long list of interesting flags along with the function [method?] checking the flag. I even saw my new friend 8672.

Well, that was the long rambling version of how I discovered this. TLDR is that a breakpoint onĀ sqldk!CSessionTraceFlags ::CheckSessionTraceInternal exposes the trace flag being checked in a register. And of course, Paul White found it first.

Too Many TempDB Datafiles?

Maybe you’ve seen the Microsoft recommendations on tempdb, that multiple datafiles help reduce contention. They don’t simply say 1 file per CPU though, but to stop at 8 and slowly increase from there. Of course, my curiosity was piqued. Why increase slowly and carefully? Can we run into problems with too many datafiles? How many is too many?

Since there’s no official Microsoft guidance on breaking tempdb (geez, guys…), I looked to the wisdom of an ancient meme, and modified tempdb to have over 9000 datafiles.

That should be enough.

40 minutes later, I’m ready to see if anything breaks.

The obvious starting point is creating a temp table, and sure enough, it’s slower than before, about 30X slower.

Also interesting is that scans are really slow. Realllllllly slow.

Even though there’s obviously a problem here, nothing shows up in the wait stats. (Below is a simple differential script I use).

What’s SQL Server burning CPU on? To figure this out I turn to PerfView (ht Josh).

GetNextScanIAM? The engine appears to be spending most of its time traversing the IAM chain, which means it’s using an Allocation Order Scan. In short, SQL Server has multiple access methods. An Allocation Order Scan is only available under certain conditions, which which my query on a large local temp table apparently qualifies for. The pages are located from the mapping pages (IAMs), instead of going page-to-page along the leaf level. In normal circumstances, I find an allocation order scan to be faster, but here, it chokes, because there are 5k+ IAMs.

I want to test my hypothesis though. I know that allocation order scans don’t occur when there’s an ORDER BY, so what happens when I add that?

Wow. From 22387ms of CPU down to 60.

The fun doesn’t stop here of course (by “fun” I mean the horrible performance problems you might expect from having thousands of datafiles).

It turns out MAX-length variables start having issues after a certain size, since they can be stored in tempdb. Again, the wait stats are silent, but PerfView makes me suspect these variables are written to tempdb as part of a LOB.

The version store (used for optimistic isolation levels) lives in tempdb, so I thought I’d test it out. I crafted a decent workload, confirmed it added significantly to the version store, and then watched it have no problems whatsoever. In fact, it appears to be faster! Oh well…I did learn that the version store usually (for my test) doesn’t cause physical writes to tempdb, but just hangs out in memory until cleaned up.

Spills get written to tempdb, but I didn’t see any performance issues there either, well, other than the fact that there’s a spill happening.

Oh hey, remember how SQL Server rebuilds tempdb every time it restarts? And that I mentioned creating over 9000 files took 40 minutes? You can guess what happens next. Though the SQL Server service restarted within seconds, logins failed for a half hour while tempdb was rebuilding. Fun!

Even a developer would have recognized (probably) that adding thousands of datafiles is a Bad Idea. But this is a Hold My Beer demo, and ended up teaching me a few things about SQL Server. I hope it helped you too!

Powershell to Search for Trace Flags

I think of trace flags like buttons in a Bond car. Find the right one and you have a missile launcher. Press a different one and you’re now frantically trying to stop the self-destruct mechanism. Still, I don’t think I’d be able to to resist pushing buttons, and that’s what I get to do with SQL Server.

There are some really interesting trace flags that expose the internal workings of SQL Server. And there are many more that change the internal workings. Although nobody’s found the self-destruct flag yet, there are some that are pretty close.

So of course, I’m going to try every trace flag.

I’m doing this on a test server of course, one that I don’t mind crashing and burning. My goal is to find flags that change output to the message tab. If you want to see an impressive search method that looks for plan-affecting flags, my SQL senpai Joe Obbish has a post here.

Obviously, testing 11k+ flags one at a time in SSMS is not performant, so I needed a way to grab message output programmatically. Stealing code from here and here, I hacked together a powershell script that I fully expect someone to critique. Even I, a DBA, know that global variables smell funny.

There’s a really cool trace flag that adds info on hash joins to the message output, so maybe there’s something that adds info on spills? I’ll use my spilling query (slightly modified) from a previous post to check.

Alas, though a number of interesting trace flags show up, none of them seem to include spill details. :'(

Some trace flags modify the output of other trace flags. I’m really interested in 8615, which shows the final memo. Let’s see if anything modifies its output (I get the length of the plain 8615 message, then see if any flag changes the length of message output).

Success! I get a number of known trace flags, but also 8672, which I can’t find good info on. That one’s going in the backlog of things to blog about (a backblog?).

I have quite a few variations I want to try with this method, as well as potential improvements. Using DBCC TRACEON instead of QUERYTRACEON seems an obvious one, along with only getting estimated plans (when applicable) to save time.

Still looking for the missile launcher though.

Do Spills Get Logged?

Ever see a SELECT query with WRITELOG waits? I did, and exclaimed, “Oooh, interesting!” to the annoyance of all my cube neighbors. There was no data changing to require logging, so where was it coming from? After a little thought, I hypothesized that a spill to tempdb generates log activity.

But can I prove it?

The first ingredient for this experiment is a spill, and probably the most common operator to spill would be the sort, so let’s build a table to sort:

Ok, now we need a query that will spill.Ā A spill occurs when SQL Server doesn’t provide enough memory to an operator, and it has to use space in tempdb to complete the operation. So we need a way to get SQL Server to under-provision memory. The most common way this occurs in the wild is when the actual number of rows passed to a sort is far higher than what the optimizer estimated. But this is SCIENCE[!] so let’s screw with the memory grant instead.

Now we need to read the log generated. Start with a CHECKPOINT command. Since tempdb is in simple recovery (and there are no other ongoing transactions because this is a test instance…right?), this will clear the logs. Then run the spilling query (I use a trash variable to prevent displaying results) and follow it up with the handy-dandy undocumented fn_dblog().



I see a lot of activity on PFS, GAM, and IAM pages that makes me think this is minimally efficiently logged – only the allocations without the content of the spill. It’s even possible to find one of these sort pages by checking allocated pages mentioned in the description column. You can go digging for one yourself, if it’s the kind of thing to make you exclaim, “Oooh, interesting!”

The Longest Row

It’s hard to ask a question about SQL Server to which the answer isn’t, “It depends.” Unless of course you’re asking if you should use Auto-Shrink (the answer is NO! and an angry glare).Ā  Edge cases are everywhere, and hidden gotchas lurk around the corner even for well-known features.

But what about the maximum size of a row on page? Microsoft and even error messages say that you can’t go past 8060 bytes. Surely we can trust that this is an accurate limit?

Hold my beer.

Start with a basic setup, then try to go past 8060 bytes.

Yup, there’s the error message.

Right-size a column to get exactly 8060 bytes. We can even use DBCC PAGE to confirm its length.

Make a couple changes, add another row…

And now we have 8082 bytes!

Once you stop applauding this incredible feat, you might stop to ask how I managed it. What’s going on is combination of two tricks. The first has to do with enabling Snapshot Isolation. Snapshot is an isolation level that allows users to view data consistent to a point in time without locking the row. SQL Server manages this by storing old versions of the row in tempdb, and keeping a pointer to those versions in a 14-byte versioning tag added to the row.

The last 8 bytes come from the clustered index I added. Notice how there’s a clustered index, but no primary key? It’s how I was able to add two rows with the same ID of “1”. SQL Server has to uniquely identify rows though (it’s how it points back to the source row from non-clustered indexes), so it adds a “uniquefier” taking up a total of 8 bytesĀ to the second row.

Like any good Hold My Beer stunt, this one ends with a big error. Watch what happens if you try to update that second row.

I wouldn’t be surprised if there’s a way to push this limit even higher. Spend some time investigating row length for yourself, and see if you can break my record of 8082 bytes. Then, when someone asks how you maintain your grand DBA physique, just say “competitive rowing.”