Is this zero-downtime database migration plan viable?

471 Views Asked by At

I am pondering about performing a zero-downtime database migration and came up with minimum necessary steps.

By "migration" I mean any change in the same database that is not backward-compatible such as renaming, splitting or dropping a column.

Since I couldn't find much information elsewhere, I would like to validate my reasoning with someone having hands-on experience with this. Let's imagine we have a capability to perform rolling deployments, otherwise I don't believe that zero downtime DB migration is possible. So:

  1. Initial state: V1 is deployed in prod. It uses table1.oldColumn
  2. Goal: rename table1.oldColumn to table1.newColumn with zero downtime

Steps:

  1. Create table1.newColumn: ALTER TABLE table1 ADD COLUMN newColumn(...)

  2. Gradually deploy V2. The V2 code contains the following changes:

    • SELECTs use oldColumn: SELECT oldColumn FROM table1 WHERE userId = 1001. That's because only oldColumn contains full data for now while newColumn contains only a subset of it
    • UPDATEs use both, but when a new value is missing in newColumn, it's copied from oldColumn. If we don't do that, we will chase constantly changing oldColumn forever
    • INSERTs use both columns: INSERT INTO table1 (oldColumn, newColumn) VALUES ('abcd', 'abcd')
    • DELETEs are usually irrelevant because the delete remove the entire row: DELETE FROM table1 WHERE userId = 1001
      • However, if the column is a UNIQUE KEY, then the oldColumn is used: DELETE FROM table1 WHERE oldColumn = 'xyz'
  3. Now that all new data is always in sync, we still have a diff between oldColumn and newColumn. In order to liquidate difference between oldColumn and newColumn, we run a background script copying values missing in newColumn from oldColumn

  4. Now that columns are in sync, gradually deploy V3. V3 code contains the following changes: SELECTs, UPDATEs, INSERTs and DELETEs go to newColumn now. table1.oldColumn is not used anymore

  5. Drop the unused table1.oldColumn: ALTER table1 DROP COLUMN oldColumn

Note: steps 3 and 5 can be performed as part of the database migration during V2 and V3 startup

Recap:

  1. Initially newColumn is empty and all data goes to oldColumn
  2. While we gradually replace V1 with V2, data starts to flow into oldColumn alongside newColumn. At this point some data still flows into oldColumn only (because we are performing a rolling update so not all instances are V2)
  3. As soon as V2 is deployed, data flows in both oldColumn and newColumn. We mirror updates and inserts to keep columns in sync
  4. However, some data was inserted into oldColumn before newColumn was devised and some data got there from remaining V1 instances that existed during the rolling update. We must get rid of this difference
  5. When the script is run, data in oldColumn missing in newColumn gets copied there
1

There are 1 best solutions below

0
On

your use of terms is a bit confusing as what you are describing is not "migration" as the term is normally used. Also, it is not clear what your requirements are that you've described as needing zero downtime. Downtime means making something unavailable for a period of time; you can add/drop a column from a table without making that table unavailable to users so the change requires zero downtime - but obviously any query that referenced a dropped column will no longer work.

If you want to change your DB structure without breaking anything then either you need control over everything that accesses the DB (which is unlikely to be possible) and you can deploy the DB change and everything affected by it in one go - or you can protect users from changes by using views that hide the database implementation from them and only allowing the users to access the views.

If you make changes that are so fundamental that they cannot be hidden in a view definition change then you probably have no choice but to communicate this change to your users and they will all need to go through a proper SDLC to determine if the change will affect them and to update their code if it does