We're running SQL Server 2017 Enterprise Edition on a pair of EC2 instances (we'll call them "P1" and "P3" here) in AWS. Each EC2 instance runs two instances of SQL Server (one for OLTP and one for OLAP). For maximum HA, but to keep costs manageable, we typically run in a "criss-cross" AG architecture, where OLTP will be "live" on one EC2 instance, and OLAP will be "live" on the other, with each SQL instance mirroring data via Always On across to the other EC2 node.

We have been running in this architecture for approximately two years and it has served us well. About two months ago, however, we lost one of the EC2 instances ("P2" --- boot loop - blue screen of death), so we had to shift to running both SQL instances on a single EC2 node ("P1") while we built a replacement server ("P3"), and added it to the Windows cluster, and then into the AGs. The Availability Group mirroring (P1 -> P3) seems to be running well, and the AG dashboards are GREEN.

HOWEVER, despite green AG dashboards, when we try to manually fail over either the OLTP or OLAP SQL instance AGs from P1 to P3 (the new machine), it fails to bring the second node online (and these attempts have often taken us DOWN, with the windows cluster requiring manual intervention to bring things back up).

The System Event logs and the Cluster logs speak of the listener IP address being a "duplicate IP address", but our Systems team has searched far and wide and indicates that the IP in question is NOT duplicated anywhere on the network, leading us to believe something else must be going on. I am hoping someone here might have ideas.

I have obfuscated the IP addresses ("xx.xx.xx.xxx") in the below log entries, but they reference the AG Listener(s).

SYSTEM EVENT LOG errors:

Cluster IP address resource 'SQLOLAP-AG_xx.xx.xx.xxx' cannot be brought online because a duplicate IP address 'xx.xx.xx.xxx' was detected on the network. Please ensure all IP addresses are unique. The Cluster service failed to bring clustered role 'SQLOLAP-AG' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

The Cluster service failed to bring clustered role 'SQLOLAP-AG' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

Windows Cluster Log errors:

00002da4.0000433c::2023/07/17-23:08:58.329 ERR [RCM] rcm::RcmApi::AddPossibleOwner: (5010)' because of 'RcmRpcAddPossibleOwner( _rcm, pGroup->GetOwnerNode()->NodeNumber, pRes->ObjectId, nodeId )' 00003108.00006b5c::2023/07/17-23:08:59.175 ERR [RES] IP Address <SQLOLAP-AG_xx.xx.xx.xxx>: Open: Unable to find netinterface for node 2 on network b60a5465-f4f5-475d-b265-28377315b5b2, status 5035. 00003108.00006b5c::2023/07/17-23:08:59.175 WARN [RHS] Resource SQLOLAP-AG_xx.xx.xx.xxxhas indicated that it cannot come online on this node. 00002da4.00004c48::2023/07/17-23:08:59.175 WARN [RCM] HandleMonitorReply: ONLINERESOURCE for 'SQLOLAP-AG_xx.xx.xx.xxx', gen(2) result 5015/0. 00003108.00006888::2023/07/17-23:08:59.184 ERR [RES] IP Address <SQLOLAP-AG_xx.xx.xx.xxx>: Unable to create and set new IP interface, status 5010 00003108.00006888::2023/07/17-23:08:59.184 ERR [RES] IP Address <SQLOLAP-AG_xx.xx.xx.xxx>: Failed to create IP address xx.xx.xx.xxx, status 5010. 00003108.00006888::2023/07/17-23:08:59.184 WARN [RHS] Online for resource SQLOLAP-AG_xx.xx.xx.xxx failed. 00002da4.00003688::2023/07/17-23:08:59.184 WARN [RCM] HandleMonitorReply: ONLINERESOURCE for 'SQLOLAP-AG_xx.xx.xx.xxx', gen(4) result 5018/0.

We initially thought the issue was that the [NT AUTHORITY\SYSTEM] login needed the ALTER ANY AVAILABILITY GROUP permission, but after correcting this oversight, the issues continue to persist.

We cleared the ARP cache - both with the command and when we rebooted P3. We also gave the cluster object in AD domain admin permissions, which is full control over all objects in AD. No change. P1 and P3 are identical with regards to AD. Same everything. Same OU, Same Permissions, same people have permissions to them.

We have tried engaging with Microsoft, but it has been VERY slow going in getting any responses. Their one suggestion to date (likely from a Tier 1 support person, not a subject matter expert, has been to

increase the Maximum Failures in the Specified Period value in Failover Cluster Manager

which seems like a very unlikely fix (we've yet to schedule another maintenance window to try this, and frankly would like to go into the next attempt with more ideas than just that).

We have noted some discrepancies in setup between the new EC2 instance and the legacy one, and we are wondering if the "answer" might lie somewhere in here --- we are running out of other ideas:

In SQL Server Configuration Manager:

  1. There is an interesting discrepancy as to what IPs SQL Server itself is "aware" of on each server (in terms of TCP/IP entries in SQL Server Configuration Manager). The P1 node is not "aware" of all the IPs on the server - notably it is not aware of the IP assigned to P1 for the OLAP and OLTP Listeners. Conversely, MSSQL on P3 (new server) is aware of ALL the IPs on the server, including the IPs assigned to P3 for the OLTP and OLAP Listeners (which is what is referenced in the error as "duplicate"). We are wondering if these Config entries make these IP addresses somehow "reserved" and immutable now because of this, therefore causing the listener's failure when attempting to switch over to that during failover??

  2. On P3 (new server), the "Link-local IPv6 Address" defined for this server (seen via ipconfig) exists as an IP in SQL Server Configuration Manager TCP/IP Properties IP addresses....but has both "Active" and "Enabled" set to "No." On P1, by comparison, the "Link-local IPv6 Address" that is defined there (based on ipconfig) is not present at all in SQL Server Configuration manager.

More generally: 3. There is a discrepancy in DHCP vs. Static between P1 and P3 (P1 set to DHCP, P3 set to static). While intuitively, one might think “Static” to be the right answer, the fact is that P1->P2 functioned fine for years with each of those nodes running under a DHCP configuration (with the listener addresses being “reserved” (at the DNS layer by our Systems team)). I question the potential adverse implications of having each cluster/AG node configured differently in this regard? Our Systems folks compared the NICs on P1 and P3. Other than one being "DHCP" and one being static, they are the same.

Thanks

0

There are 0 best solutions below