Separating schema and data migrations

150 Views Asked by At

We currently have mixed data and schema migrations, that are being run by app's ORM right before the app startup. Data migrations in our case is a type of migration when we manipulate the data, for example moving data around different columns, fixing data format issues, fixing some wrong data or in some rare cases moving data between different services databases. They are normally simple ( a few create/update statements), but there are cases when we leverage programming language with some logic to calculate a final state. What I'm looking for is to make database schema declarative and apply schema migrations whenever someone changes db schema file in the repository. This could be achieved by using some tools like atlasgo.io, where you have a schema file defined and generate versioned schema migrations by calculating a diff between state file and actual database.

So if I apply this approach, our schema and data migrations become separate. I planned to run schema migrations first using schema migration tools and only then data migrations using the same ORM approach. But it looks that it will bring us some troubles. Imagine if we don't deliver code on some environment for some time and there are a few data and schema migrations ready to run. Because I run schema migrations first - there are high chances that data migrations may fail because they could expect some schema that may have already been changed. So the order of schema and data migrations is very important.

Do you have any advice how do you run such things on your projects?

2

There are 2 best solutions below

0
Eugene Shilin On BEST ANSWER

I wanted to sum up and post the answer that worked for me based on the answers and the data that I was able to find in the internet.

First of all - yes, there is no possibility and even need to separate data and schema migrations. But there is a still possibility to benefit having a declarative database schema in your code which you could use to automatically generate schema migrations (alter table or create table statements).

Specifically for me, I decided to keep using the ORM approach as a runner of all kind of migrations (because of the order issue mentioned above, therefore I choose one tool to run them all) and having a schema file in my application code which is being used by a separate tool (atlas as mentioned before) to calculate the diff between current schema and target schema and produce ALTER table statements as a migration whenever developer wants to make an update the schema. And as a last step, I have taught my ORM migration tool to run both types of migrations: produced bare sql schema migrations and data migrations that we already had. Hope this answer will help someone else.

7
Bill Karwin On

Data migrations should be designed for the latest schema revision at the time the data migration is developed.

Of course there is no guarantee that data migration will work with any older or newer revision of the schema. There's nothing you can do about that. It would be nigh impossible to transform any arbitrary data migration after the fact to fit any later revision of the schema (in some cases it might be possible, but not in general).

So this leads to the conclusion that you can't separate data migrations and schema migrations. They must be applied in chronological order. That is, one queue of migrations, both data and schema. That way the data migrations can be applied against the correct revision of the schema.

Later schema migrations may affect data too (e.g. ALTER TABLE...DROP COLUMN, or changes to data types). That must be by design.

It is also typical that your database has one small table to store the current revision. I've seen this table called schema_version or something like that. The name doesn't matter, as long as your migration framework knows to look for that table. If the current schema_version data stores an out of date version, the framework knows which data migrations and schema migrations to apply. These migrations are stored in your code repository.

I've also seen many cases where a developer applied some migration but forgot to update the schema_version. It happens. You can try to use tools and processes to follow in the hope of minimizing it, but humans make mistakes.

If the schema_version is not accurate, then subsequent attempts to apply migrations get confused. It may try to apply migrations that have already been applied, or else skip a migration that needs to be applied.

Someone has to go figure out which step was missed, and fix it up manually.

There's no automatic solution to this, it involves reading the code of your migrations, then examining the current state of the data and schema to see what has and has not been applied, and then the human to develop custom steps to fix what's wrong.

I don't think there's any perfect process that doesn't involve occasional manual work to correct missteps. Much like any other field of engineering!