pt-table-checksum not detecting diffs

802 Views Asked by At

I have a simple master->slave setup with MariaDB:

Master: Ubuntu 16.04 LTS with MariaDB 10.2.8 and percona-toolkit 3.0.4

Slave: Ubuntu 16.04 LTS with MariaDB 10.2.7

Replication is running fine and now I want to check if data is identical between master and slave.

I installed percona-toolkit on the master and created a checksum user:

MariaDB> GRANT REPLICATION SLAVE,PROCESS,SUPER, SELECT ON *.* TO `pt_checksum`@'%' IDENTIFIED BY 'password';
MariaDB> GRANT ALL PRIVILEGES ON percona.* TO `pt_checksum`@'%';
MariaDB> FLUSH PRIVILEGES;

I also added report_host in the slave conf, so that it presents itself to the master:

MariaDB [(none)]> show slave hosts;
+-----------+-----------+------+-----------+
| Server_id | Host      | Port | Master_id |
+-----------+-----------+------+-----------+
|         2 | 10.0.0.49 | 3306 |         1 |
+-----------+-----------+------+-----------+
1 row in set (0.00 sec)

To test pt-table-checksum I removed a row from the Tickets table in my test database on the slave. I have verified that this row is indeed missing but still present on the master.

But pt-table-checksum does not report this difference:

# pt-table-checksum --databases=shop_test --tables=Tickets --host=localhost --user=pt_checksum --password=... --no-check-binlog-format --no-check-replication-filters
        TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
09-07T16:15:02      0      0       14       1       0   0.013 shop_test.Tickets

So I set PTDEBUG=1 in my environment, but it seems that the master connects fine to the slave. I tried to pick out the relevant bits from the output:

# MasterSlave:5175 9725 Connected to h=localhost,p=...,u=pt_checksum
# MasterSlave:5184 9725 SELECT @@SERVER_ID
# MasterSlave:5186 9725 Working on server ID 1
# MasterSlave:5219 9725 Looking for slaves on h=localhost,p=...,u=pt_checksum using methods processlist hosts
# MasterSlave:5226 9725 Finding slaves with _find_slaves_by_processlist
# MasterSlave:5288 9725 DBI::db=HASH(0x31c5190) SHOW GRANTS FOR CURRENT_USER()
# MasterSlave:5318 9725 DBI::db=HASH(0x31c5190) SHOW FULL PROCESSLIST
# DSNParser:1417 9725 Parsing h=10.0.0.49
[...]
# MasterSlave:5231 9725 Found 1 slaves
# MasterSlave:5208 9725 Recursing from h=localhost,p=...,u=pt_checksum to h=10.0.0.49,p=...,u=pt_checksum
# MasterSlave:5155 9725 Recursion methods: processlist hosts
[...]
# MasterSlave:5175 9725 Connected to h=10.0.0.49,p=...,u=pt_checksum
# MasterSlave:5184 9725 SELECT @@SERVER_ID
# MasterSlave:5186 9725 Working on server ID 2
# MasterSlave:5097 9725 Found slave: h=10.0.0.49,p=...,u=pt_checksum
[...]
# pt_table_checksum:9793 9725 Exit status 0 oktorun 1
# Cxn:3764 9725 Destroying cxn
# Cxn:3774 9725 DBI::db=HASH(0x31cd218) Disconnecting dbh on slaveserver h=10.0.0.49
# Cxn:3764 9725 Destroying cxn
# Cxn:3774 9725 DBI::db=HASH(0x31c5190) Disconnecting dbh on masterserver h=localhost

I am out of ideas, as why the missing row is not detected?

1

There are 1 best solutions below

0
On BEST ANSWER

I noticed a new bug-report during the weekend, and I have confirmed today that this is indeed the problem I am experiencing.

The workaround is to add --set-vars binlog_format=statement.

When I set this option, the difference reveals itself after the second run.

During the first run the checksum table on the slave changes from:

MariaDB [percona]> select tbl, this_crc, this_cnt, master_crc,master_cnt from checksums where tbl = 'Tickets' and db = 'shop_test';
+---------+----------+----------+------------+------------+
| tbl     | this_crc | this_cnt | master_crc | master_cnt |
+---------+----------+----------+------------+------------+
| Tickets | f30abebe |       14 | f30abebe   |         14 |
+---------+----------+----------+------------+------------+

...to...

MariaDB [percona]> select tbl, this_crc, this_cnt, master_crc,master_cnt from checksums where tbl = 'Tickets' and db = 'shop_test';
+---------+----------+----------+------------+------------+
| tbl     | this_crc | this_cnt | master_crc | master_cnt |
+---------+----------+----------+------------+------------+
| Tickets | 284ec207 |       13 | f30abebe   |         14 |
+---------+----------+----------+------------+------------+

And after the second run, the diff is also present in pt-checksum-table output:

# pt-table-checksum --tables=shop_test.Tickets --host=localhost --user=pt_checksum --password=... --no-check-binlog-format --no-check-replication-filters --set-vars binlog_format=statement
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
09-11T11:17:37      0      1       14       1       0   0.022 shop_test.Tickets

I checked with SHOW VARIABLES LIKE 'binlog_format' that binlog_format is still 'MIXED', so apparently it only changes for the duration of the session. According to the doc's, this should happen automatically, as far as I understand:

This works only with statement-based replication (pt-table-checksum will switch the binlog format to STATEMENT for the duration of the session if your server uses row-based replication).

Bug report: https://jira.percona.com/browse/PT-1443

[UPDATE] Issue is still unresolved in september 2020