We have a MySQL innodb cluster in one of our environments. One of the nodes in the cluster was crashed. Though, we were able to bring the crashed node online we were unable to join it to the cluster.
Can someone please help to recover/restore the node and join it to the cluster. We tried to use "dba.rebootClusterFromCompleteOutage()" but it didn't help.
Configuration: MySQL 5.7.24 Community Edition, CentOS 7, standard three node innodb cluster
Cluster Status:
MySQL NODE02:3306 ssl JS > var c=dba.getCluster()
MySQL NODE02:3306 ssl JS > c.status()
{
"clusterName": "QACluster",
"defaultReplicaSet": {
"name": "default",
"primary": "NODE03:3306",
"ssl": "REQUIRED",
"status": "OK_NO_TOLERANCE",
"statusText": "Cluster is NOT tolerant to any failures. 1 member is not active",
"topology": {
"NODE02:3306": {
"address": "NODE02:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE"
},
"NODE03:3306": {
"address": "NODE03:3306",
"mode": "R/W",
"readReplicas": {},
"role": "HA",
"status": "ONLINE"
},
"NODE01:3306": {
"address": "NODE01:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "(MISSING)"
}
}
},
"groupInformationSourceMember": "mysql://clusterAdmin@NODE03:3306"
}
Errors logged in mysql error log:
2019-03-04T23:49:36.970839Z 3624 [Note] Slave SQL thread for channel 'group_replication_recovery' initialized, starting replication in log 'FIRST' at position 0, relay log './NODE01-relay-bin-group_replication_recovery.000001' position: 4
2019-03-04T23:49:36.985336Z 3623 [Note] Slave I/O thread for channel 'group_replication_recovery': connected to master 'mysql_innodb_cluster_r0429584112@NODE02:3306',replication started in log 'FIRST' at position 4
2019-03-04T23:49:36.988164Z 3623 [ERROR] Error reading packet from server for channel 'group_replication_recovery': The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires. (server_errno=1236)
2019-03-04T23:49:36.988213Z 3623 [ERROR] Slave I/O for channel 'group_replication_recovery': Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.', Error_code: 1236
2019-03-04T23:49:36.988226Z 3623 [Note] Slave I/O thread exiting for channel 'group_replication_recovery', read up to log 'FIRST', position 4
2019-03-04T23:49:36.988286Z 41 [Note] Plugin group_replication reported: 'Terminating existing group replication donor connection and purging the corresponding logs.'
2019-03-04T23:49:36.988358Z 3624 [Note] Error reading relay log event for channel 'group_replication_recovery': slave SQL thread was killed
2019-03-04T23:49:36.988435Z 3624 [Note] Slave SQL thread for channel 'group_replication_recovery' exiting, replication stopped in log 'FIRST' at position 0
2019-03-04T23:49:37.016864Z 41 [Note] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='NODE02', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2019-03-04T23:49:37.030769Z 41 [ERROR] Plugin group_replication reported: 'Maximum number of retries when trying to connect to a donor reached. Aborting group replication recovery.'
2019-03-04T23:49:37.030798Z 41 [Note] Plugin group_replication reported: 'Terminating existing group replication donor connection and purging the corresponding logs.'
2019-03-04T23:49:37.051169Z 41 [Note] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2019-03-04T23:49:37.069184Z 41 [ERROR] Plugin group_replication reported: 'Fatal error during the Recovery process of Group Replication. The server will leave the group.'
2019-03-04T23:49:37.069304Z 41 [Note] Plugin group_replication reported: 'Going to wait for view modification'
2019-03-04T23:49:40.336938Z 0 [Note] Plugin group_replication reported: 'Group membership changed: This member has left the group.'
I did the following to restore the failed node from backup and able to recover the cluster state.
1)Below is the status of the cluster when one of the nodes failed (NODE01).
2) Take mysqldump from the master node (healthy node) using the following command.
3) Execute below step to remove the failed node from the cluster.
4) Stop group replication if it is still running on failed node.
5) Reset "gtid_executed" on the failed node.
6) Disable "super_readonly_flag" on the failed node.
7) Restore the mysqldump from master on to the failed node.
8) Once restore is completed enable "super_readonly_flag" on the failed node.
9) Finally add the failed node back to the innodb cluster.