I am pondering about performing a zero-downtime database migration and came up with minimum necessary steps.
By "migration" I mean any change in the same database that is not backward-compatible such as renaming, splitting or dropping a column.
Since I couldn't find much information elsewhere, I would like to validate my reasoning with someone having hands-on experience with this. Let's imagine we have a capability to perform rolling deployments, otherwise I don't believe that zero downtime DB migration is possible. So:
- Initial state: V1 is deployed in prod. It uses
table1.oldColumn
- Goal: rename
table1.oldColumn
totable1.newColumn
with zero downtime
Steps:
Create
table1.newColumn
:ALTER TABLE table1 ADD COLUMN newColumn(...)
Gradually deploy V2. The V2 code contains the following changes:
- SELECTs use oldColumn:
SELECT oldColumn FROM table1 WHERE userId = 1001
. That's because onlyoldColumn
contains full data for now whilenewColumn
contains only a subset of it - UPDATEs use both, but when a new value is missing in newColumn, it's copied from
oldColumn
. If we don't do that, we will chase constantly changingoldColumn
forever - INSERTs use both columns:
INSERT INTO table1 (oldColumn, newColumn) VALUES ('abcd', 'abcd')
- DELETEs are usually irrelevant because the delete remove the entire row:
DELETE FROM table1 WHERE userId = 1001
- However, if the column is a UNIQUE KEY, then the oldColumn is used:
DELETE FROM table1 WHERE oldColumn = 'xyz'
- However, if the column is a UNIQUE KEY, then the oldColumn is used:
- SELECTs use oldColumn:
Now that all new data is always in sync, we still have a diff between
oldColumn
andnewColumn
. In order to liquidate difference betweenoldColumn
andnewColumn
, we run a background script copying values missing innewColumn
fromoldColumn
Now that columns are in sync, gradually deploy V3. V3 code contains the following changes: SELECTs, UPDATEs, INSERTs and DELETEs go to
newColumn
now.table1.oldColumn
is not used anymoreDrop the unused
table1.oldColumn
:ALTER table1 DROP COLUMN oldColumn
Note: steps 3 and 5 can be performed as part of the database migration during V2 and V3 startup
Recap:
- Initially
newColumn
is empty and all data goes tooldColumn
- While we gradually replace V1 with V2, data starts to flow into
oldColumn
alongsidenewColumn
. At this point some data still flows intooldColumn
only (because we are performing a rolling update so not all instances are V2) - As soon as V2 is deployed, data flows in both
oldColumn
andnewColumn
. We mirror updates and inserts to keep columns in sync - However, some data was inserted into
oldColumn
beforenewColumn
was devised and some data got there from remaining V1 instances that existed during the rolling update. We must get rid of this difference - When the script is run, data in
oldColumn
missing innewColumn
gets copied there
your use of terms is a bit confusing as what you are describing is not "migration" as the term is normally used. Also, it is not clear what your requirements are that you've described as needing zero downtime. Downtime means making something unavailable for a period of time; you can add/drop a column from a table without making that table unavailable to users so the change requires zero downtime - but obviously any query that referenced a dropped column will no longer work.
If you want to change your DB structure without breaking anything then either you need control over everything that accesses the DB (which is unlikely to be possible) and you can deploy the DB change and everything affected by it in one go - or you can protect users from changes by using views that hide the database implementation from them and only allowing the users to access the views.
If you make changes that are so fundamental that they cannot be hidden in a view definition change then you probably have no choice but to communicate this change to your users and they will all need to go through a proper SDLC to determine if the change will affect them and to update their code if it does