There are lots of ways to do dumb stuff with SQL Server. Fun, right? If you’re familiar with ASYNC_NETWORK_IO, then you’re likely aware of the dumbness that causes it.

Otherwise, here’s a quick rundown. Client programs ask for data from the database. If SQL Server has data ready to give, but the client isn’t asking for it yet, worker threads stand around tapping their feet, whistling the Jeopardy tune, and producing ASYNC_NETWORK_IO waits.

There are two common reasons for the client to be slower at receiving data than SQL Server is in providing it. The first is a SELECT * kind of situation, where the client is asking for a large (for certain values of large) amount of data. The other reason is when the client slowly processes data as it’s received, commonly row-by-row.

Sound like an “application issue”? Yeah, sure is. Which means you might need some extra proof for your well informed and highly talented developers. Here, have a PowerShell demo.

Oh, and in case you don’t have your own, here’s a simple script to grab waits in 20s. Make sure to start it before the PS script.

Maybe you want something a little closer to the application code that’s actually causing the problem. Guess what, you can call .NET objects in PowerShell!

You probably noticed that my queries include large extra columns. If you run the script without them…

…the ASYNC_NETWORK_IO waits disappear, even though the row-by-row processing remains! There’s a reason for this: SQL Server doesn’t actually send results row-at-a-time, but stuffs as many as it can into a packet before shipping it off. Without the fluff, all 10 integers arrive at once, and the script no longer leaves the database waiting.

Now for a conversation with those skilled and deeply knowledgeable developers…

Slow Reports for Query Store

My first “Holy crap!” moment with Query Store was when I enabled it in production and the dang thing promptly cleared the plan cache (as designed, apparently). My second was seeing the next day’s CPU utilization nearly doubled due to spinlock contention (since fixed in a CU). Talk about side effects. In light of this, do you think I have Query Store enabled in production? Yes, yes I do, because it’s just so useful.

The Problem

My other big complaint is that often the built-in reports don’t work. You know, the ones double-clickable in SSMS like “Top Resource Consuming Queries.” All I get is Waiting… that runs forever.

I once let it run for a couple hours with no luck. And killing the query somehow cleared the plan cache too! Holy crap #3. Thankfully I have an AG, and my good-enough workaround was to run the reports on a secondary while I pursued other problems.

Then, one day I had to check Query Store in a lower environment, and behold, I had the same slowness issue, but the report actually finished! Pent up annoyance boiled into action, and I pulled the query, ran it, snagged the actual plan, and caught the surprise culprit.

In all the slow variations I found, the pattern was the same. Repeated access to a TVF packaged with sys.query_store dmvs would consume huge amounts of CPU, most commonly occurring alongside a Row Count Spool. It didn’t even matter if the TVF returned zero rows – it was just the number of access times, averaging around 60ms each on my production system. Multiply that across hundreds of thousands of plans/queries being checked, and you have a recipe for slow.

The solution will be to set up plan guides for the SSMS queries that avoid repeated reads of the TVFs. But why stop here? [Actually, feel free to stop here. There’s nothing but gratuitous nerdery until the fix-it scripts] I have PerfView, and I know how to use it!

Fruitless Investigation

Well that’s interesting – pretty much all CPU went into CopyOutAllPlans. But why stop here? I have windbg, and I pretend to know how to use it!

I learned a new trick in windbg, and discovered that AdvanceToNextPlan was getting called hundreds of thousands of times per TVF execution – again, regardless of whether any rows were returned. My current hypothesis is that this TVF has to scan the entire memory allocation of its object, each and every time. Unfortunately, I wasn’t able to figure out which memory object it falls under, as FREESYSTEMCACHE(‘ALL’) did nothing to the in-memory QS data. I’m trying to improve my windbg skills enough to figure out memory information from debugging, but that may be a ways off.


It was a little tough getting a plan guide to work, due to a requirement that the query match character for character, but my solution is to pull the exact text from the plan as it’s running. So the fix is to:
1) Open up Top Resource Consuming Queries
2) Run the below script while the report is still spinning
3) Cancel the report and run again.
The process can be repeated for the reports you actually look at, just remember to use a new plan guide name!

I’m certain there’s a more efficient combination of hints than (HASH JOIN, LOOP JOIN) to prevent the problem, but these have worked well enough for me so far. Please let me know if you find something better.

There aren’t many folks using Query Store, or even 2016 yet, so I guess I’m paying the early-adopter tax (you’re welcome). But for all the frustrating issues I’ve encountered, I still see a genuinely useful product mixed in. And the “Holy crap!” moments keep pushing me to deepen my skills and write mildly entertaining blog posts wherein I pretend to use windbg.

The Three Physical Joins, Visualized

Less than a month into my first SQL job, I was handed a report that “suddenly stopped working.” I followed the Excel sheet to a data connection to a stored procedure to a SELECT to a view on a view on a view.
The difference between the query that worked and the one that never finished (at least, over the 10 hours I let it run), was a Hash versus a Loop join. It was my first introduction to query plans and to the importance of physical joins.

But what are they? Distinct from logical joins (INNER, OUTER, CROSS, etc.), physical joins are the different algorithms available to actually compare data from different tables. You don’t have to specify which one SQL Server will use, and usually SQL Server gets it right. But there’s a large gap between usually and always, and the fine art of query tuning lives in that gap. Having a good understanding of the joins is essential for mastering tuning (if mastery is even possible), so let’s take a look.

Nested Loops

Overview: The simplest algorithm is Nested Loops. Imagine you have two piles [tables] of values, and you want to find the matches. Take the first value from the first pile, and compare it to each value in the second pile. Then grab the next value from the first pile, and compare to everything in the second. Grab the third value…wash, rinse, repeat. If you write the pseudo-code for the algorithm, it looks something like below.

Hey, it’s one For-Each loop inside of another loop…you might even say, Nested inside of another. Ahem. Anyways, here’s a visualization.

Drawbacks: An important detail of Nested Loops to notice is that values in the bottom (inner) input get read multiple times, which can quickly grow out of control for large tables. This isn’t an issue if an index on that inner table allows seeking to the exact match, but otherwise the amount of work to compare rows increases drastically with size.

Merge Join

Overview: Merge Join allows comparison while reading each input value only once. The catch is that each input has to be sorted (and on the compared values). Roughly, start at the beginning of each input and progress through the values, grabbing the next input value from whichever is lower. I have pseudo-code for Merge, but it didn’t seem helpful, so I’m skipping straight to the picture.

Drawbacks: Remember how I said that Merge reads each input value only once? I lied. There’s a special (and not uncommon) case where that’s not true. If both inputs have repeated values, called Many-to-Many, the algorithm gets really complicated. As in, holy-crap-why-is-it-taking-so-long complicated. But if you can guarantee unique values in at least one input (like my gif above shows), you’re golden. So, Merge has two big gotchas: the many-to-many case, and the requirement for both inputs to be sorted.

Hash Match

Overview: The final algorithm is Hash Match, also known as Hash Join. Like Merge, it only needs a single pass through each input. It’s also the most complicated. Here’s how it works (with minor lies simplifications): there are two phases, called the build phase and probe phase. In the build phase, each value from the top input is hashed. The hash and the value are stored in a bucket. There are multiple buckets, and the hash value determines which bucket the hash and original value will be stored in. Then comes the probe phase. For each value in the second input, that value will be hashed, and the hash compared to hashes in the buckets. If the hash matches, then the values stored are checked to confirm the match. With an inner join, values that match are outputted, and otherwise discarded.

Below is my animated Hash Match, but first a note on my fake-for-learning-purposes-only hash function. I take each input number, spell it out, and take the first letter. For example, 1 becomes “one” becomes “o.” 5 becomes “five” becomes “f.”

Drawbacks: So why isn’t everything a Hash Join? Well, surprise surprise, it has its own weaknesses. The biggest one to consider is that “storing hashes and values in buckets” piece. Storing occurs in memory (hopefully), and memory is a limited resource. And, if the Hash Match needs more memory than it was given, it doesn’t get more, but writes (spills) to disk instead, killing performance.

I would probably summarize an introduction to the joins as below

JoinStrengthsCommonly SeenBlows Up When
Nested LoopsSimple, fast with the right indexes.Small joins with proper indexes.Inner table has lots of rows repeatedly read.
MergeVery fast with sorted inputs.Key to key joins. Query includes ORDER BY.Necessary sorts take too long. Many-to-Many.
HashGenerally fast.Heaps. Large joins.Can’t get enough memory.

The three algorithms each have different strengths, weaknesses, hidden gotchas, and bizarre variants. I’ve fallen in love with the subtleties and complexities that arise from deceptively simple starting points, and it pained me to have to skip over important details (such as Merge and Hash joins requiring an equality comparison). There’s really a lot more to write about, which I intend to do…eventually. In the meantime, if you want to read more, I suggest looking at Craig Freedman’s articles and Hugo Kornelis’ site.

So is NOLOCK Actually Faster?

NOLOCK has a reputation. Several reputations in fact. To the database snob, NOLOCK is a vestigial practice of the untrained masses, a Pavlovian reflex borne [yes, borne] of fossilized ignorance and apathy. Millions of man-hours of coding have gone into building robust ACID databases that guarantee consistency… to be casually thrown away by peons who can’t be bothered to actually fix their design.

For others, NOLOCK is the two-click salvation when the system is down, the company is losing thousands of dollars a minute, and the full weight of the corporate totem pole is pressing down. Incorrect results? Sure, in theory. Never seen one. NOLOCK works, so why not keep using it?

What about another reputation? That NOLOCK is the “go faster” hint? This one can be tested, and I love testing! There are three main ways it might work:
1) because it doesn’t get blocked
2) because it doesn’t have the overhead of taking locks
3) allocation order scans

Using NOLOCK to get around blocking is…I’m trying to think of a word here…dumb. I’m not talking about running an ad-hoc query on a busy production system where exact results don’t matter. I’m talking about adding the hint to EVERYTHING as a matter of practice. If your queries are slow from blocking, you need to fix how much data they’re reading, often through indexes, or move to RCSI. But others have written more on this.

Regarding allocation order scans, they help the most on severely fragmented tables – the ones with a GUID as clustering key for example. They also seem to help with physical reads. In fact, if you search for “NOLOCK performance testing,” the top result shows using the hint is faster, but the reason in that test is the different scan type. But when your data is in memory and not especially fragmented (you know, the way you’d expect when you’re not being bad at databases), the allocation scan doesn’t help as much.

So what about the second scenario? Ignoring blocking, ignoring allocation scans, does the locking overhead make a difference? Let’s find out!

Here’s my setup – four tables wide and narrow, max column and not.

I then run the following SELECT statements that scan the tables. The WHERE clause makes sure no rows get returned and sent over the network. I use a warm cache because that’s a bit more realistic for most environments.

Average scan speeds (in ms) are as follow:

Table Normal NOLOCK
narrow 51 53
wide 794 525
narrow_max 53 52
wide_max 1433 1124

So yes, it looks like NOLOCK is faster, but only on wider tables in a scan situation. Seriously though, if your system is running a lot of scans on really wide tables, you might want to reconsider if speeding them up is your real priority.

What about an OLTP-style test, with a lot of single-row seeks? I used SQL Query Stresser with 40 threads of 10s WHILE loops and the below procs.

And here’s what I used to grab execution stats:

And here are my results, measured in executions, averaged over three runs.

Table Normal NOLOCK
narrow 11,059,073 10,530,684
wide 10,770,979 10,838,683
narrow_max 10,821,620 10,533,960
wide_max 10,443,677 10,035,070

The runs had enough variance to them that I’m not confident in saying a particular method is a clear loser, but it should be obvious that NOLOCK is no help here.

And overall I think NOLOCK is a loser. Sure, it’s faster in scans on wide tables, but maybe you should fix the scans and the wide tables, instead of slapping on a bandaid. You know, you could TABLOCK hint for the same effect, without risking correctness.

There’s more I could test, and probably will. It turns out databases are complicated, or something. NOLOCK performance has long interested me – my only (as of this moment) dba.stackexchange question is about a case where NOLOCK is slower. In the end, though, I admit that I’m a database snob. Stop using NOLOCK, and fix your design!

Gotta Cache ‘Em All: Query Store Survey

I have a friend who’s really into Pokemon. Really into Pokemon. I had no idea there was such a thing as a “shiny” pokemon until he, uh, graciously spent a whole evening lecturing me on their value and capture.

It’s a passion I don’t really get, but have to tolerate, maybe even respect. It turns out I’m the same, just with different monsters. You see, I really like plan operators. I like knowing their behavior, I like learning their natural habitat, and I love hunting for exotics. If I could find a way to stuff and mount Flow Distinct on my wall, I probably would.

There a handful of ways to see what operators roam your production environment. If you’re lucky enough to be working with 2016 or later
(I’ve seen the stats though, condolences for those of you still on 2005), you can peek into Query Store. Actually, with how long the query takes to run, it might be more of an expedition than a peek. Caveat executor.Ā 

Here’s the script – again, mind that this can be an expensive query.

The most common operator for me, at a whopping quarter-million occurrences, is the Clustered Index Seek. That’s probably a good thing, since I do OLTP. I still see plenty of scans, because developers, and a decent range of not-rare-but-not-common operators, like the Table Spool (hmm, I should probably look into that query…).

Then there’s the esoteric bottom of the list, where operators are only seen a couple of places instead of a quarter-million. One of these was new to me: Inserted Scan, which comes from the Inserted pseudo-table in a trigger. I have a couple oddities like Parameter Table Scan and Hash Match for a Full Outer Join.

Of course, hiding down there are three shiny Flow Distincts. Time for me to hunt down a repro!

Cascading Deletes – Cardinality

Who knew cascading deletes could be so horribly entertaining (emphasis on horrible)? The problems I wrote about traced back to cardinality, but I wasn’t ready to write about it. So, uh, I might have spent most of a perfectly good weekend reading Dmitry and Microsoft patents.

The scan plan was coming from a really bad cardinality assumption: given a Parent ID sufficiently outside the stats histogram, *every* Child table row would join to it. But why was I getting such a ridiculous estimate?

Time for an undocumented trace flag (gosh, I love these)! Dmitry’s post introduced me to trace flag 2363, which outputs cardinality estimation info for the new estimator. I tossed in a QUERYTRACEON, and got the following.

Coooooooool! The optimizer considers stats on Parent.ID to be a badly-formed histogram, whatever that is. Moreover, it just gives up, moves to CSelCalcGuessComparisonJoin, and gives selectivity of 1 (aka everything). Nice guess.

What happens when the ID is within range?

Some huge amount of testing later, I’m able to determine that the bad bailout only occurs when 1) Parent stats are “badly-formed” and 2) the ID is out of range. Stats updated with a FULLSCAN aren’t badly-formed, which seems to be why the problem disappears in that circumstance.

Also interesting is what happens in 2014 compatibility with QO Fixes off.

It ejects, but uses CSelCalcSimpleJoin, with a much more reasonable guess.

I kind of have my answer at this point, which that the reason for crappy cardinality is buried deep within the proprietary bowels of SQL Server’s estimator design. Luckily there was a holiday sale on unsatisfying answers, and I got a new question for free.

What the heck is a “badly-formed histogram?”

Spelunking with windbg left me in the dark, so I resorted to empirical testing with stats donating. The best I can tell, a badly-formed histogram only occurs for a column that is unique and has decimal-valued range_rows. In turn, decimal-valued range_rows only seem to occur for large-enough tables with sampled statistics. How’s that for a summary of weeks of investigation?

To be honest, I’m not really happy with the state of my understanding here. My answers are more anecdote than explanation, only looking at a lower layer of turtle. Maybe if I booze up a Microsoft employee enough I can get them to share info outside of NDA…

Anyways, I’m off to read more patents.

Cascading Delete Troubles

I had an interesting conversation with a developer the other day – it went something like this:

Developer: “Why does it take three minutes to delete a single row from this table?”

Me: “Hold on while I look at the plan. Oh, it’s because SQL Server is checking the 460 foreign keys that reference that table, 447 of which don’t have a supporting index.”

I know the pains of foreign keys without indexes. Oh, do I know. So when I saw a cascading single-row delete plan with a scan on the child table, I thought, “Aha! I know how to fix this, an index!”

But the index was already there, a perfect one for supporting that foreign key with the abominable ON DELETE CASCADE . Something was obviously wrong, and I immediately suspected our recent switch to 2016 compatibility level – meaning a new cardinality estimator.

Sure enough, in 2012 mode the query optimizer produced a seek.

I was able to come up with a repro of the issue (which is what the above plans are from).

The short of it is, with any value too far outside of the histogram of the parent table, the new cardinality estimator assumes that EVERY SINGLE ROW of the child table will get deleted.

This can be seen even when a seek is forced through a USE PLAN hint.

Interestingly enough, running the new cardinality estimator in 2014 compatibility level with Query Optimizer Fixes off will also produce a seek plan. I found a few other ways to get the good plan, and stuck them in a bonus script.

There’s something I don’t understand happening with the cardinality estimator here (as is usual). It seems when the value being deleted from the Parent table is within its histogram, the optimizer uses stats from the Child. Otherwise, it assumes the entire Child table matches. Ouch.

Remember how my demo only has two tables? The real scenario I encountered was a cascading delete hitting forty tables, and, you guessed it, scanning each one. Block party!

Forcing plans with the help of Query Store only goes so far, especially when it’s playing whack-a-mole with ORM-generated SQL. The best solution for now might be to retreat to 2012 compatibility and complain on the internet. Bah!

Debugging for Trace Flags

Good thing I have a test machine, because I just crashed SQL Server again. Prepare yourselves for (yet another) post about doing horrible things to the database for fun and profit more fun.

Obligatory “blah blah serious warning blah blah DON’T DO THIS IN PRODUCTION blah blah or you’ll go blind blah blah.” Ok, let’s get started! I have a documented obsession with undocumented trace flags and have long wanted to use the debugger to discover new ones. I just didn’t comprehend what I saw in windbg until I learned about registers and how a program actually runs. Thanks, Crash Course Computing!

Armed with an incomplete understanding of fundamentals and a googled list of windbg commands, I hypothesized that if I turned on some trace flags for a query, they would eventually be passed through one of the registers. From there, I could work out the mechanisms involved and maybe even how to find new flags!

Running a query with a ridiculously long compile time let me capture optimization call stacks with PerfView. The common root of those seemed to be sqllang!CSQLSource::Transform, so I set a breakpoint on it. F11 let me step into the various classes [I don’t even know if I’m using the right terms here] called, and I eventually hit a lucky break.

Behold, sqldk!CSessionTraceFlags::CheckSessionTraceInternal

Clues just don’t get more obvious than that.

Setting a breakpoint on that function [module? class?] revealed that it gets called a lot. I kept F11-ing through the function while watching the registers. I had turned on several trace flags and was looking for their hex forms, but didn’t see them. Eventually, I ran across a call to CheckSessionTraceInternal originating from sqllang!CExecLvlIntCtxt::FDontExpandPersistedCC. “Aha!” I thought (and might have said out loud) “I know what trace flag prevents expanding persisted computed columns.” A glance at my favorite list of trace flags confirmed what I remembered; I was looking at TF 176. The windows calculator has a handy hex converter in its “Progammer” setting, and I quickly determined I would be looking for “b0” in a register.


The pattern soon became obvious.Ā CheckSessionTraceInternal is called whenever a function checks whether a trace flag is turned on. Luckily, whenĀ CheckSessionTraceInternal is called, the value of the trace flag it’s checking is passed through register rdx (I think technically edx, but don’t ask me the difference yet) and then stored in r10.

I spent a dozen-plus hours hitting breakpoints onĀ CheckSessionTraceInternal, and made a long list of interesting flags along with the function [method?] checking the flag. I even saw my new friend 8672.

Well, that was the long rambling version of how I discovered this. TLDR is that a breakpoint onĀ sqldk!CSessionTraceFlags ::CheckSessionTraceInternal exposes the trace flag being checked in a register. And of course, Paul White found it first.

Too Many TempDB Datafiles?

Maybe you’ve seen the Microsoft recommendations on tempdb, that multiple datafiles help reduce contention. They don’t simply say 1 file per CPU though, but to stop at 8 and slowly increase from there. Of course, my curiosity was piqued. Why increase slowly and carefully? Can we run into problems with too many datafiles? How many is too many?

Since there’s no official Microsoft guidance on breaking tempdb (geez, guys…), I looked to the wisdom of an ancient meme, and modified tempdb to have over 9000 datafiles.

That should be enough.

40 minutes later, I’m ready to see if anything breaks.

The obvious starting point is creating a temp table, and sure enough, it’s slower than before, about 30X slower.

Also interesting is that scans are really slow. Realllllllly slow.

Even though there’s obviously a problem here, nothing shows up in the wait stats. (Below is a simple differential script I use).

What’s SQL Server burning CPU on? To figure this out I turn to PerfView (ht Josh).

GetNextScanIAM? The engine appears to be spending most of its time traversing the IAM chain, which means it’s using an Allocation Order Scan. In short, SQL Server has multiple access methods. An Allocation Order Scan is only available under certain conditions, which which my query on a large local temp table apparently qualifies for. The pages are located from the mapping pages (IAMs), instead of going page-to-page along the leaf level. In normal circumstances, I find an allocation order scan to be faster, but here, it chokes, because there are 5k+ IAMs.

I want to test my hypothesis though. I know that allocation order scans don’t occur when there’s an ORDER BY, so what happens when I add that?

Wow. From 22387ms of CPU down to 60.

The fun doesn’t stop here of course (by “fun” I mean the horrible performance problems you might expect from having thousands of datafiles).

It turns out MAX-length variables start having issues after a certain size, since they can be stored in tempdb. Again, the wait stats are silent, but PerfView makes me suspect these variables are written to tempdb as part of a LOB.

The version store (used for optimistic isolation levels) lives in tempdb, so I thought I’d test it out. I crafted a decent workload, confirmed it added significantly to the version store, and then watched it have no problems whatsoever. In fact, it appears to be faster! Oh well…I did learn that the version store usually (for my test) doesn’t cause physical writes to tempdb, but just hangs out in memory until cleaned up.

Spills get written to tempdb, but I didn’t see any performance issues there either, well, other than the fact that there’s a spill happening.

Oh hey, remember how SQL Server rebuilds tempdb every time it restarts? And that I mentioned creating over 9000 files took 40 minutes? You can guess what happens next. Though the SQL Server service restarted within seconds, logins failed for a half hour while tempdb was rebuilding. Fun!

Even a developer would have recognized (probably) that adding thousands of datafiles is a Bad Idea. But this is a Hold My Beer demo, and ended up teaching me a few things about SQL Server. I hope it helped you too!

Powershell to Search for Trace Flags

I think of trace flags like buttons in a Bond car. Find the right one and you have a missile launcher. Press a different one and you’re now frantically trying to stop the self-destruct mechanism. Still, I don’t think I’d be able to to resist pushing buttons, and that’s what I get to do with SQL Server.

There are some really interesting trace flags that expose the internal workings of SQL Server. And there are many more that change the internal workings. Although nobody’s found the self-destruct flag yet, there are some that are pretty close.

So of course, I’m going to try every trace flag.

I’m doing this on a test server of course, one that I don’t mind crashing and burning. My goal is to find flags that change output to the message tab. If you want to see an impressive search method that looks for plan-affecting flags, my SQL senpai Joe Obbish has a post here.

Obviously, testing 11k+ flags one at a time in SSMS is not performant, so I needed a way to grab message output programmatically. Stealing code from here and here, I hacked together a powershell script that I fully expect someone to critique. Even I, a DBA, know that global variables smell funny.

There’s a really cool trace flag that adds info on hash joins to the message output, so maybe there’s something that adds info on spills? I’ll use my spilling query (slightly modified) from a previous post to check.

Alas, though a number of interesting trace flags show up, none of them seem to include spill details. :'(

Some trace flags modify the output of other trace flags. I’m really interested in 8615, which shows the final memo. Let’s see if anything modifies its output (I get the length of the plain 8615 message, then see if any flag changes the length of message output).

Success! I get a number of known trace flags, but also 8672, which I can’t find good info on. That one’s going in the backlog of things to blog about (a backblog?).

I have quite a few variations I want to try with this method, as well as potential improvements. Using DBCC TRACEON instead of QUERYTRACEON seems an obvious one, along with only getting estimated plans (when applicable) to save time.

Still looking for the missile launcher though.