We have 9 nodes Ceph cluster. Ceph version is 15.2.5. The cluster has 175 OSD (HDD) + 3 NVMe for cache tier for "cephfs_data" pool. CephFS pools info:
POOL ID STORED OBJECTS USED %USED MAX AVAIL
cephfs_data 1 350 TiB 179.53M 350 TiB 66.93 87 TiB
cephfs_metadata 3 3.1 TiB 17.69M 3.1 TiB 1.77 87 TiB
We use multiple active MDS instances: 3 "active" and 3 "standby". Each MDS server has 128GB RAM, "mds cache memory limit" = 64GB.
Failover to a standby MDS instance takes 10-15 hours! CephFS is unreachable for the clients all this time. The MDS instance just stays in "up:replay" state for all this time. It looks like MDS demon checking all of the folders during this step. We have millions folders with millions of small files. When the folders/subfolders scan is done, CephFS is active again. I believe 10 hours downtime during MDS failover is unexpected behaviour. Is there any way to force MDS to change status to active and run all of the required directory checks in the background? How can I localise the root cause?
P.S.: we tried standby-replay and it helps but doesn't eliminate the root cause.
The mds_log_max_segments = 100000 is the root cause. The value should be smaller than 1000.