Ceph MDS stays in "up:replay" for hours. MDS failover takes 10-15 hours

1.2k Views Asked by At

We have 9 nodes Ceph cluster. Ceph version is 15.2.5. The cluster has 175 OSD (HDD) + 3 NVMe for cache tier for "cephfs_data" pool. CephFS pools info:

POOL                    ID  STORED   OBJECTS  USED     %USED  MAX AVAIL
cephfs_data              1  350 TiB  179.53M  350 TiB  66.93     87 TiB
cephfs_metadata          3  3.1 TiB   17.69M  3.1 TiB   1.77     87 TiB

We use multiple active MDS instances: 3 "active" and 3 "standby". Each MDS server has 128GB RAM, "mds cache memory limit" = 64GB.

Failover to a standby MDS instance takes 10-15 hours! CephFS is unreachable for the clients all this time. The MDS instance just stays in "up:replay" state for all this time. It looks like MDS demon checking all of the folders during this step. We have millions folders with millions of small files. When the folders/subfolders scan is done, CephFS is active again. I believe 10 hours downtime during MDS failover is unexpected behaviour. Is there any way to force MDS to change status to active and run all of the required directory checks in the background? How can I localise the root cause?

P.S.: we tried standby-replay and it helps but doesn't eliminate the root cause.

1

There are 1 best solutions below

0
On

The mds_log_max_segments = 100000 is the root cause. The value should be smaller than 1000.