Zombie Transactions Ate my Database

A failover that needs five minutes for the secondary database to come online is a problem, especially if it generates stack dumps. Thankfully, Microsoft has Top Men, and with their help digging into source code, we uncovered…well, something neither of us had ever seen before. I’m naming them Zombie Transactions, because then I can pretend my job is more glamorous than clicking at a screen all day.

Why the Long Failover?

Every time there’s a failover, SQL Server must run the recovery process. The standard Analysis, Redo, and Undo occur, and even sometimes an extra step. In our case, it was Redo taking a ridiculously long time. According to XE, SQL Server was searching through the log for 25,000 active transactions every recovery (and incidentally producing stack dumps because this took so long).

Except these transactions weren’t really live, they were dead, shambling about our system, eating our performance. They didn’t come from normal activity, they didn’t show up in dmvs, and they couldn’t be killed with a failover.

I eventually found where they were lurking by using fn_dblog – they were in our checkpoints. Every checkpoint needs to include active transactions – it’s part of the design that allows a checkpoint to occur without freezing your whole database. And for some reason, our database was including this horde of undead transactions in each checkpoint.

Unresolved Non-DTC Cross Database…What?

This is where Microsoft support was helpful (you know, Top support, not the level 1s you have to put up with for the first month of an issue). The checkpoint entries were identified as “unresolved non-DTC cross-database transactions.” This awkward tongue-twister is the other reason I’m calling them zombies.

In other words, these are leftovers from transactions spanning multiple databases with an AG (for example, a trigger that logs changes to a separate non-AG database). If the transaction doesn’t complete correctly, it can leave an unresolved zombie behind. A minimal repro succeeded at generating a couple with a mid-transaction failover, though we never figured out how exactly 25k came about.

Zombie Hunting

I got tired of checking manually, so I wrote a script to search for zombies. It uses fn_dblog, which is undocumented and comes with caveats. I recommend running on a secondary.

USE [MyDB]
GO
CHECKPOINT
go
DECLARE @base bigint
declare @desc table (descr varchar(500))

select top(1) @base = CONVERT(bigint,CONVERT(varbinary(6),'0x'+REPLACE([Transaction ID],':',''),1))
from sys.fn_dblog(null,null)
where [Transaction ID] <> '0000:00000000'
insert @desc
select Description
from sys.fn_dblog(null,null)
where Operation = 'LOP_XACT_CKPT'

select COUNT(case when xacts.xval < @base then 1 end) as zombies, COUNT(*) as total
from (
    select CONVERT(bigint,CONVERT(varbinary(6),'0x'+x.value,1)) xval
    from (
        select REPLACE(REPLACE(descr,'XdesIDs: ',''),':','') xdes
        from @desc
    ) l
    cross apply STRING_SPLIT(l.xdes,',') x
) xacts
where xval <> 0

How to Fix [Everything with Computers]

There was some back and forth with Microsoft on how to remove these – I didn’t want to edit memory, and restarting the server didn’t help. However, I quickly discovered that setting the database offline then online fixed it. That’s right, turning it off then on again was a solution.

After some more experimenting, we figured out that setting the database to Read-Only then back would fix it, and was faster than offline/online. A teammate also discovered that reseeding is necessary, since the zombies would reappear if you fail over. He built an awesome script for doing this, and I’m sure he would be willing to help for entirely reasonable rates™. Our failover time went from five minutes to ten seconds after we cleared the zombies.

Summary

You might have zombie transactions if you’ve been running an AG without the DTC setting. If you do, and there are a lot, they can slow down failovers. To kill them off, you’ll have to set the database to read-only mode, or bring offline-online. Yes, this requires downtime.

Please let me know if you find them in your system – seriously, let me know, and let Microsoft know. That way there’s a chance of us getting a way to fix without needing downtime.