how to rejoin Mon and mgr Ceph to cluster

3k Views Asked by At

i have this situation and cand access to ceph dashboard.i haad 5 mon but 2 of them went down and one of them is the bootstrap mon node so that have mgr and I got this from that node.

    2020-10-14T18:59:46.904+0330 7f9d2e8e9700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
  cluster:
    id:     e97c1944-e132-11ea-9bdd-e83935b1c392
    health: HEALTH_WARN
            no active mgr

  services:
    mon: 3 daemons, quorum srv4,srv5,srv6 (age 2d)
    mgr: no daemons active (since 2d)
    mds: heyatfs:1 {0=heyfs.srv10.lxizhc=up:active} 1 up:standby
    osd: 54 osds: 54 up (since 47h), 54 in (since 3w)

  task status:
    scrub status:
        mds.heyfs.srv10.lxizhc: idle

  data:
    pools:   3 pools, 65 pgs
    objects: 223.95k objects, 386 GiB
    usage:   1.2 TiB used, 97 TiB / 98 TiB avail
    pgs:     65 active+clean

  io:
    client:   105 KiB/s rd, 328 KiB/s wr, 0 op/s rd, 0 op/s wr

I have to say the whole story, I used cephadm to create my cluster at first and I'm so new to ceph i have 15 servers and 14 of them have OSD container and 5 of them had mon and my bootstrap mon that is srv2 have mgr. 2 of these servers have public IP and I used one of them as a client (I know this structure have a lot of question in it but my company forces me to do it and also I'm new to ceph so it's how it's now). 2 weeks ago I lost 2 OSD and I said to datacenter who gives me these servers to change that 2 HDD they restart those servers and unfortunately, those servers were my Mon server. after they restarted those servers on of them came back srv5 but I could see srv3 is out of quorum so i begon to solve this problem so I used this command in ceph shell --fsid ...

ceph orch apply mon srv3
ceph mon remove srv3

after some while I see in my dashboard srv2 my boostrap mon and mgr is not working and when I used ceph -s ssrv2 isn't there and I can see srv2 mon in removed directory

root@srv2:/var/lib/ceph/e97c1944-e132-11ea-9bdd-e83935b1c392# ls
crash  crash.srv2  home  mgr.srv2.xpntaf  osd.0  osd.1  osd.2  osd.3  removed

but mgr.srv2.xpntaf is running and unfortunately, I lost my access to ceph dashboard now

i tried to add srv2 and 3 to monmap with

  576  ceph orch daemon add mon srv2:172.32.X.3
  577  history | grep dump
  578  ceph mon dump
  579  ceph -s
  580  ceph mon dump
  581  ceph mon add srv3 172.32.X.4:6789

and now

root@srv2:/# ceph -s
  cluster:
    id:     e97c1944-e132-11ea-9bdd-e83935b1c392
    health: HEALTH_WARN
            no active mgr
            2/5 mons down, quorum srv4,srv5,srv6

  services:
    mon: 5 daemons, quorum srv4,srv5,srv6 (age 16h), out of quorum: srv2, srv3
    mgr: no daemons active (since 2d)
    mds: heyatfs:1 {0=heyatfs.srv10.lxizhc=up:active} 1 up:standby
    osd: 54 osds: 54 up (since 2d), 54 in (since 3w)

  task status:
    scrub status:
        mds.heyatfs.srv10.lxizhc: idle

  data:
    pools:   3 pools, 65 pgs
    objects: 223.95k objects, 386 GiB
    usage:   1.2 TiB used, 97 TiB / 98 TiB avail
    pgs:     65 active+clean

  io:
    client:   105 KiB/s rd, 328 KiB/s wr, 0 op/s rd, 0 op/s wr

and I must say ceph orch host ls doesn't work and it hangs when I run it and I think it's because of that err no active mgr and also when I see that removed directory mon.srv2 is there and you can see unit.run file so I used that command to run the container again but it says mon.srv2 isn't on mon map and doesn't have specific IP and by the way I must say after ceph orch apply mon srv3 i could see a new container with a new fsid in srv3 server

I now my whole problem is because I ran this command ceph orch apply mon srv3 because when you see the installation document :

To deploy monitors on a specific set of hosts:

# ceph orch apply mon *<host1,host2,host3,...>*

Be sure to include the first (bootstrap) host in this list.

and I didn't see that line !!!

now I manage to have another mgr running but I got this

root@srv2:/var/lib/ceph/mgr# ceph -s
2020-10-15T13:11:59.080+0000 7f957e9cd700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
  cluster:
    id:     e97c1944-e132-11ea-9bdd-e83935b1c392
    health: HEALTH_ERR
            1 stray daemons(s) not managed by cephadm
            2 mgr modules have failed
            2/5 mons down, quorum srv4,srv5,srv6

  services:
    mon: 5 daemons, quorum srv4,srv5,srv6 (age 20h), out of quorum: srv2, srv3
    mgr: srv4(active, since 8m)
    mds: heyatfs:1 {0=heyatfs.srv10.lxizhc=up:active} 1 up:standby
    osd: 54 osds: 54 up (since 2d), 54 in (since 3w)

  task status:
    scrub status:
        mds.heyatfs.srv10.lxizhc: idle

  data:
    pools:   3 pools, 65 pgs
    objects: 301.77k objects, 537 GiB
    usage:   1.6 TiB used, 97 TiB / 98 TiB avail
    pgs:     65 active+clean

  io:
    client:   180 KiB/s rd, 597 B/s wr, 0 op/s rd, 0 op/s wr

and when I run the ceph orch host ls i see this

root@srv2:/var/lib/ceph/mgr# ceph orch host ls
HOST   ADDR          LABELS  STATUS
srv10  172.32.x.11
srv11  172.32.x.12
srv12  172.32.x.13
srv13  172.32.x.14
srv14  172.32.x.15
srv15  172.32.x.16
srv2   srv2
srv3   172.32.x.4
srv4   172.32.x.5
srv5   172.32.x.6
srv6   172.32.x.7
srv7   172.32.x.8
srv8   172.32.x.9
srv9   172.32.x.10
0

There are 0 best solutions below