We're getting ready to perform a large upgrade on our SQL Servers and are noticing some unusual behavior with Distributed Availability Groups that I'm trying to resolve before moving forward.
Last month, I upgraded a remote secondary server from SQL Server 2016 to SQL Server 2017. This server is a part of multiple Distributed Availability Groups (DAGs) and a separate Availability Group (AG). When we upgraded this server, we were unaware that it would get into an unreadable state, so during the past month we've solely been relying on the primary server.
As a part of the upcoming upgrade, I applied the CU 4 patch to the server and rebooted it. When the server came back online, the just-patched secondary showed all of the DAGs/AGs were syncing without any issues.
However, the primary was showing a very different story. It was reporting that
After initially panicking, I attempted the following things to get things synchronizing again in the DAGs:
ALTER DATABASE [<database] SET HADR RESUME;- which execute without errors, but did not resume any syncing
My last attempt at syncing the data again was to login to the secondary, and manually restart the SQL Server service. Manually restarting the service seems a bit extreme, as I'd expect the server being rebooted would have been enough.
Has anyone run into this issue where a DAG doesn't start syncing to a secondary after a reboot? If so, how was it resolved?
I checked both the SQL Server error log, and the event viewer on the secondary server, there was nothing out of the ordinary that I could see.
I'll preface this all with the caveat that I do not have any DAGs in production. Fundamentally though this advice should apply between both AGs and DAGs.
Did the synchronization resume following the service restart? If so then my best guess to the cause would be blocking on the redo SPID. If it's still not synchronizing even after the restart, here's what I'd be checking first:
Blocking of AG redo SPID
Generally only going to occur on a readable secondary. To check, run the following:
select session_id, blocking_session_id, db_name(database_id), wait_type from sys.dm_exec_requests where command = 'DB STARTUP'
If any blocking SPIDs appear, you'll need to kill them before the secondary can resume (the
DB STARTUP SPID is what handles the redo operations). I'd suggest reviewing the blocking SPID beforehand to try and determine the cause (usually a long running report).
If you want further information on this, there's a great article (including monitoring for this type of behaviour using XEs) here.
If data movement is suspended, you can refer to DMVs to get more information on the suspend reason. Run the following:
select db_name(database_id), synchronization_state_desc, database_state_desc, suspend_reason_desc from sys.dm_hadr_database_replica_states
The BOL article describes the suspend_reason a little further.
Please note, this is not a definitive answer but it's the best answer after chatting with Taryn.
However, the primary was showing a very different story. It was reporting that the separate AG was syncing without any issues but the DAGs were in a Not Synchronzing / Not Healthy state
If the individual databases and AGs underlying the distributed ag say they healthy and synchronizing, there is a good chance this is just a hiccup in the DMVs and/or SSMS dashboards. Since there was nothing in the errorlog to suggest the replica didn't connect or was in a disconnected state.
Unfortunately since the issue has resolved, it's hard to say exactly what it was... but in the future if this occurs for someone:
sqlserver.hadr_apply_log_blockto see if the secondary is actually receiving/applying the log blocks or ...
SQLServer:Database Replica\Log Bytes Received/sec
If you're receiving data on that secondary but the distributed ag still shows not synchronizing or not healthy then I'd let it go for a bit to see if the DMV values change since it's obviously receiving and processing log blocks.
If, however, it isn't then we'll need to investigate further which is out of scope of the answer.
Is your Distributed Availability Group (DAG) split between different regions? If so you could be suffering from the default SESSION_TIMEOUT value (10 seconds) being too low. This means that latency between the two regions is too high to reliably complete syncing.
A normal availability group can have its SESSION_TIMEOUT value increased to make the syncing sessions more stable. I noticed late last year that the SESSION_TIMEOUT parameter of DAG's could not be edited. This meant that DAG's were only viable for low latency scenarios. We logged a ticket with Microsoft and earlier this year a hotfix was released.