Ceph PGs not deep scrubbed in time keep increasing

13.8k Views Asked by At

I've noticed this about 4 days ago and dont know what to do right now. The problem is as follows:

I have a 6 node 3 monitor ceph cluster with 84 osds, 72x7200rpm spin disks and 12xnvme ssds for journaling. Every value for scrub configurations are the default values. Every pg in the cluster is active+clean, every cluster stat is green. Yet PGs not deep scrubbed in time keeps increasing and it is at 96 right now. Output from ceph -s:

  cluster:
    id:     xxxxxxxxxxxxxxxxx
    health: HEALTH_WARN
            1 large omap objects
            96 pgs not deep-scrubbed in time

  services:
    mon: 3 daemons, quorum mon1,mon2,mon3 (age 6h)
    mgr: mon2(active, since 2w), standbys: mon1
    mds: cephfs:1 {0=mon2=up:active} 2 up:standby
    osd: 84 osds: 84 up (since 4d), 84 in (since 3M)
    rgw: 3 daemons active (mon1, mon2, mon3)

  data:
    pools:   12 pools, 2006 pgs
    objects: 151.89M objects, 218 TiB
    usage:   479 TiB used, 340 TiB / 818 TiB avail
    pgs:     2006 active+clean

  io:
    client:   1.3 MiB/s rd, 14 MiB/s wr, 93 op/s rd, 259 op/s wr

How do i solve this problem? Also ceph health detail output shows that this non deep-scrubbed pg alerts started in january 25th but i didn't notice this before. The time I noticed this was when an osd went down for 30 seconds and got up. Might it be related to this issue? will it just resolve itself? should i tamper with the scrub configurations? For example how much performance loss i might face on client side if i increase osd_max_scrubs to 2 from 1?

5

There are 5 best solutions below

9
On BEST ANSWER

Usually the cluster deep-scrubs itself during low I/O intervals on the cluster. The default is every PG has to be deep-scrubbed once a week. If OSDs go down they can't be deep-scrubbed, of course, this could cause some delay. You could run something like this to see which PGs are behind and if they're all on the same OSD(s):

ceph pg dump pgs | awk '{print $1" "$23}' | column -t

Sort the output if necessary, and you can issue a manual deep-scrub on one of the affected PGs to see if the number decreases and if the deep-scrub itself works.

ceph pg deep-scrub <PG_ID>

Also please add ceph osd pool ls detail to see if any flags are set.

0
On

Ceph will not scrub during recovery by default. You can tell all osd's that it's ok to do it for now (lasts until restart)

ceph tell 'osd.*' injectargs --osd-scrub-during-recovery=1

1
On

You can set the deep scrub period to 2 week, to stretch the deep scrub window. Insted of

 osd_deep_scrub_interval = 604800

use:

 osd_deep_scrub_interval = 1209600

Mr. Eblock has a good idea to force manually some of the pgs for deep scrub , to spread the actions evently within 2 week.

0
On

You have 2 options:

  1. Increase the interval between deep scrubs.
  2. Control deep scrubbing manually with a standalone script.

I've written a simple PHP script which takes care of deep scrubbing for me: https://gist.github.com/ethaniel/5db696d9c78516308b235b0cb904e4ad

It lists all the PGs, picks 1 PG which have a last deep scrub done more than 2 weeks ago (the script takes the oldest one), checks if the OSDs that the PG sits on are not being used for another scrub (are in active+clean state), and only then starts a deep scrub on that PG. Otherwise it goes looking for another PG.

I have osd_max_scrubs set to 1 (otherwise OSD daemons start crashing due to a bug in Ceph), so this script works nicely with the regular scheduler - whichever starts the scrubbing on a PG-OSD first, wins.

0
On

in addition to the other answers, here is a command line to start scrubbing the late pgs:

ceph health detail | grep "not scrubbed since" | awk '{ print $2 }' | xargs -n1 ceph pg scrub