My first “Holy crap!” moment with Query Store was when I enabled it in production and the dang thing promptly cleared the plan cache (as designed, apparently). My second was seeing the next day’s CPU utilization nearly doubled due to spinlock contention (since fixed in a CU). Talk about side effects. In light of this, do you think I have Query Store enabled in production? Yes, yes I do, because it’s just so useful.
The Problem
My other big complaint is that often the built-in reports don’t work. You know, the ones double-clickable in SSMS like “Top Resource Consuming Queries.” All I get is Waiting… that runs forever.
I once let it run for a couple hours with no luck. And killing the query somehow cleared the plan cache too! Holy crap #3. Thankfully I have an AG, and my good-enough workaround was to run the reports on a secondary while I pursued other problems.
Then, one day I had to check Query Store in a lower environment, and behold, I had the same slowness issue, but the report actually finished! Pent up annoyance boiled into action, and I pulled the query, ran it, snagged the actual plan, and caught the surprise culprit.
In all the slow variations I found, the pattern was the same. Repeated access to a TVF packaged with sys.query_store dmvs would consume huge amounts of CPU, most commonly occurring alongside a Row Count Spool. It didn’t even matter if the TVF returned zero rows – it was just the number of access times, averaging around 60ms each on my production system. Multiply that across hundreds of thousands of plans/queries being checked, and you have a recipe for slow.
The solution will be to set up plan guides for the SSMS queries that avoid repeated reads of the TVFs. But why stop here? [Actually, feel free to stop here. There’s nothing but gratuitous nerdery until the fix-it scripts] I have PerfView, and I know how to use it!
Fruitless Investigation
Well that’s interesting – pretty much all CPU went into CopyOutAllPlans. But why stop here? I have windbg, and I pretend to know how to use it!
I learned a new trick in windbg, and discovered that AdvanceToNextPlan was getting called hundreds of thousands of times per TVF execution – again, regardless of whether any rows were returned. My current hypothesis is that this TVF has to scan the entire memory allocation of its object, each and every time. Unfortunately, I wasn’t able to figure out which memory object it falls under, as FREESYSTEMCACHE(‘ALL’) did nothing to the in-memory QS data. I’m trying to improve my windbg skills enough to figure out memory information from debugging, but that may be a ways off.
Solution
It was a little tough getting a plan guide to work, due to a requirement that the query match character for character, but my solution is to pull the exact text from the plan as it’s running. So the fix is to:
1) Open up Top Resource Consuming Queries
2) Run the below script while the report is still spinning
3) Cancel the report and run again.
The process can be repeated for the reports you actually look at, just remember to use a new plan guide name!
--USE YourDB
DECLARE @badqry NVARCHAR(MAX)
SELECT TOP (1) @badqry = SUBSTRING(st.text, (r.statement_start_offset/2) + 1,
((CASE statement_end_offset
WHEN -1 THEN DATALENGTH(st.text)
ELSE r.statement_end_offset
END - r.statement_start_offset)/2) + 1)
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) st
WHERE st.text LIKE '(@results_row_count int,@interval_start_time%'
IF @badqry IS NULL
RAISERROR('Missed the plan',16,1)
EXEC sp_create_plan_guide
@name = N'Fixing QueryStore Top Duration',
@stmt = @badqry,
@type = N'SQL',
@module_or_batch = NULL,
@params = N'@results_row_count int,@interval_start_time datetimeoffset(7),@interval_end_time datetimeoffset(7)',
@hints = N'OPTION (HASH JOIN, LOOP JOIN)'
I’m certain there’s a more efficient combination of hints than (HASH JOIN, LOOP JOIN) to prevent the problem, but these have worked well enough for me so far. Please let me know if you find something better.
There aren’t many folks using Query Store, or even 2016 yet, so I guess I’m paying the early-adopter tax (you’re welcome). But for all the frustrating issues I’ve encountered, I still see a genuinely useful product mixed in. And the “Holy crap!” moments keep pushing me to deepen my skills and write mildly entertaining blog posts wherein I pretend to use windbg.