Ok, I get it, scheduling queries can be complicated. See this and this and maybe this and this too but only if you have insomnia. I still thought I kinda understood it. Then I started seeing hundreds of query timeouts on a quiet server, where the queries were waiting on…what? CPU? The server’s at 20%, why would a query time out after a minute, while only getting 5s of CPU, and not being blocked?
It was Resource Governor messing up one of my servers recently in a way I never expected. Apparently RG will try to balance *total* CPU usage for pools, to the point that a single query sharing a scheduler will be completely starved.
Uh, this might be easier to explain with a picture:
And a demo of course:
Let’s set up some Resource Pools with no constraints. (Thanks to this blog for making some easy-to-copy scripts). Again, these pools don’t limit anything – the only thing they do is exist. Behold, pools Apple and Banana:
USE master GO CREATE RESOURCE POOL [Apples] WITH( min_cpu_percent=0, max_cpu_percent=100, min_memory_percent=0, max_memory_percent=100, AFFINITY SCHEDULER = AUTO ) GO CREATE RESOURCE POOL [Bananas] WITH( min_cpu_percent=0, max_cpu_percent=100, min_memory_percent=0, max_memory_percent=100, AFFINITY SCHEDULER = AUTO ) GO CREATE WORKLOAD GROUP [Apple] USING [Apples] GO CREATE WORKLOAD GROUP [Banana] USING [Bananas] GO CREATE FUNCTION dbo.Fruit_Inspector() RETURNS SYSNAME WITH SCHEMABINDING AS BEGIN DECLARE @workload_group sysname; IF (program_name() LIKE 'A%') SET @workload_group = 'Apple'; IF (program_name() LIKE 'B%') SET @workload_group = 'Banana'; RETURN @workload_group; END; GO ALTER RESOURCE GOVERNOR WITH (CLASSIFIER_FUNCTION = dbo.Fruit_Inspector); GO ALTER RESOURCE GOVERNOR RECONFIGURE GO
Now for a little bit of SSMS trickery: you can use connection properties to set an application name of your choice, assigning your connection to a resource pool.
Check which scheduler you’re on (and your session id, workload group, and app name) with this query:
SELECT er.scheduler_id, wg.name AS wg_name, er.session_id, es.program_name FROM sys.dm_exec_requests er JOIN sys.resource_governor_workload_groups wg ON wg.group_id = er.group_id JOIN sys.dm_exec_sessions es ON er.session_id = es.session_id WHERE er.session_id = @@SPID
Get a session ready for Resource Pool Apple – it doesn’t matter what scheduler it’s on. Place an infinite-loop CPU-sucking query in it, ready to go. I don’t start it until later.
DECLARE @trash bigint WHILE 1=1 BEGIN SELECT @trash = COUNT(*) FROM dbo.Votes end
Now get sessions for Pools Apple and Banana that happen to land on an identical scheduler (I did this by opening a bunch of Apple sessions, then reconnecting the Banana session until I got a match).
Start the other infinite-loop query, then run a high-CPU query on each of colliding sessions – here’s what I used:
SELECT COUNT_BIG(*) FROM dbo.Votes a UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a UNION ALL SELECT COUNT_BIG(*) FROM dbo.Votes a OPTION(MAXDOP 1,USE HINT('QUERY_OPTIMIZER_COMPATIBILITY_LEVEL_140'))
Remember we have two queries running from Pool Apple, and one from Pool Banana, where the Banana query is on the same scheduler as an Apple query. Here’s the Banana runtime – 15s of pure CPU:
But here’s the Apple runtime:
Apple took 30s because it wasn’t allowed *any* CPU time while Banana was running. Because SQLOS was trying to balance the resource pools with the infinite query running, the Apple query was waiting on SOS_SCHEDULER_YIELD for the whole Banana query, which I do think is absolutely bananas.
You’ll even see super-long scheduler yield waits with this pattern.
Yes, seriously, waiting 5 seconds to get time on the CPU.
Please note this doesn’t even need to be a heavy query from Pool A:
In summary, have the doodle again:
In other summary, Resource Governor is scary, I blame it for hundreds of avoidable timeouts, and I hope MS improves their documentation.
P.S. If you followed my demo, that infinite loop is still running – might wanna kill it. Even worse, RG is still around, which you can fix with the below:
ALTER RESOURCE GOVERNOR DISABLE