SQL Server 2019 DAG WFC - Manual Failover won't work (MSSQL Error 41131)

650 Views Asked by At

We set up a distributed failover cluster with 2 Windows Server 2019 Datacenter nodes, each of them running SQL Server 2019 Enterprise + SSMS18.

  • The two nodes are located in two different sites with two different IP-Subnets.

  • Each Host is a ESXI VM with only one NIC (Host A in Subnet A, Host B in Subnet B).

  • Both sites are connected via a S2S-VPN Connection and routing possibilities for traffic between.

Problem

We double checked every possible problem, but we cannot get managed, to manual failover an AvailabilityGroup with a synchronized DB via SSMS

  • Instance -> Always On High Availability -> Availability Groups -> -> Right-Click "Failover"

    • SQL Server error 41131 (see attachment)

Troubleshooting

  • Connection between hosts is up and the "dashboard" shows, that both hosts are communicating, up and synchronized.

  • Defender Firewall rules are there for the DAG-listeners, the Agent, the Browser service. On a PaloAlto Firewall at site A, traffic can be detected between both SQL hosts, but no traffic is denied.

  • Both hosts run via a separate service user for SQL Server Agent and SQL Server engine, so there should not be any trouble with missing rights for the NT Authority\SYSTEM.

Rights to the AD-Clusterobject are there, to create and update any child objects. Two DNS entries for the listener and one for the cluster object are also there after the creation.

Even the automatic seeding between both hosts is working, only the failover through SMSS18 is failing (inserted rows replicate from host A to host B).

Questions

Are there any ideas, at which point we can troubleshoot?

enter image description hereI attached the Error-Message, but was not able to find any useful information online, since the only connected solution is always to change rights for the NT Account, which we do not use for Agent or Engine.

1

There are 1 best solutions below

0
John Brown On

Not sure if you were able to resolve this but here is the answer for future reference.

From here https://learn.microsoft.com/en-us/troubleshoot/sql/availability-groups/error-41131-create-availability-group

You can refence the following:

The [NT AUTHORITY\SYSTEM] account is used by SQL Server Always On health detection to connect to the SQL Server computer and to monitor health. When you create an availability group and the primary replica in the availability group comes online, health detection is initiated. If the [NT AUTHORITY\SYSTEM] account doesn't exist or have sufficient permissions, health detection can't be initiated, and the availability group can't come online during the creation process. Make sure that these permissions exist on each SQL Server computer that could host the primary replica of the availability group.

Even if the SQL Server instance and the SQL Server agent are running under a different service account, there is a process in the cluster that uses [NT AUTHORITY\SYSTEM] to connect to the SQL Server instance hosting the primary replica and it will run the procedure named "sp_server_diagnostics" which is used for health detection and it is an essential part of the Availability Group.

If you check the cluster log around the time of the failover attempt on the node that was supposed to take the primary role, you will see something like this:

INFO  [RES] SQL Server Availability Group: [hadrag] Connect to SQL Server ...
INFO  [RES] SQL Server Availability Group: [hadrag] The connection was established successfully
INFO  [RES] SQL Server Availability Group: [hadrag] Run 'EXEC sp_server_diagnostics 10' returns following information
ERR   [RES] SQL Server Availability Group: [hadrag] ODBC Error: [42000] [Microsoft][SQL Server Native Client 11.0][SQL Server]The user does not have permission to perform this action. (297)
ERR   [RES] SQL Server Availability Group: [hadrag] Failed to run diagnostics command. See previous log for error message
INFO  [RES] SQL Server Availability Group: [hadrag] Disconnect from SQL Server

Basically, the failover failed because the [NT AUTHORITY\SYSTEM] account on the new primary does not exist or does not have the necessary permissions to start the health monitor process.

I hope this helps!