CF Bosh Director Multi-DC High Availability

223 Views Asked by At

I'm looking at a new architecture deployment of Cloud Foundry using multi-cpi with a single BOSH director deployment. If the BOSH director is deployed in DC-A and manages 2 vcenters, one in DC-A and the other in DC-B, if DC-A goes offline, what are the options for BOSH to run active/standby in DC-B so that it can immediately take over deployments without having to perform a backup and restore?

1

There are 1 best solutions below

0
On

Yes, multi-DC BOSH deployments with multi-CPI BOSH is working great! And your question is very often raised when people think of such multi-DC design.

There is no high availability (HA) for a BOSH Director, and there is no active/passive setup I'm aware of right now. The reason for this, is that loosing a BOSH Director is not a big deal. The nodes that this Director manages will still run on top of the infrastructure. They just won't be “manageable” until you bring your Director back.

But if we think about the requirements for such active/passive setup, here is what I would come up with:

  1. They would have to share the exact same CPI installation and setup. Not a big deal.

  2. They would have to share the same SQL database and same blobstore (object storage). This is not a big deal, but this leads to using both external SQL storage and external blobstore. Then the “passive” BOSH Director would at least have to disable its resurrector plugin, in order not to compete with the resurrector from the “active” BOSH Director. (In fact, the passive BOSH Director would have to be completely stopped, see below.)

  3. They would have to share the same NATS message bus, which usually is co-located on the BOSH Director and thus dedicated to it. It's easy to extract this NATS from the Director and run it separately with a High Availability setup. But then the problem would be: which Director consumes NATS messages? Two Directors cannot compete in consuming those messages. That's why the “passive” BOSH Director would require its processes to be monit stop-ed, or the whole instance bosh stop-ed.

  4. This bosh stop requirement cannot be achieved using the bosh CLI (v2, including this bosh-init component which can act just like a local, stripped-down BOSH Director). So these two BOSH Directors would have to be deployed by a “bootstrap” BOSH Director (which is quite common, I've even seen up to four stages of such bootstrap Director pattern, on some customer production environments!)

  5. Now imagine you have it all. A bootstrap BOSH Director that deploys a separate HA NATS and two Directors, with same CPI setup, same external SQL database and same external blobstore. Then it would work! Whenever you loose the active one, you bosh start the passive one and it takes over. But the you should be careful that the previously active one doesn't pop up suddenly, or it would compete in consuming NATS messages, and possibly mess up the database and blobstore. That's where BOSH is missing some “lock” feature, to allow only one active Director at a time. Here, something very simple could be implemented, based on some database record that would designate which one is active and which one is passive. Switching this record manually would trigger the passive Director to become active.

This is a very good idea for the next Cloud Foundry hackathon!