Migratory movements

If your software project is successful, don’t expect to get an eternal throne to sit on forever while a Greek chorus sings praises and odes to your victories. No: the glory of victory is fleeting, and the reward for doing good work is more work.

You built your project, launched it, and gained many users, but one day, you discover that your system does not scale and that you need to replace the database. Of course, you can’t throw away the old data, initialize a new database, and start using it: you want all the old data to still be available after the move.

It seems that you’ll need to perform a migration.

A stork carrying a suitcase with a “South” sticker.

Successful migrations are all alike, but each failed migration fails in its own way. Therefore, we must study the characteristics of a good migration: it is well-planned, reversible, gradual, and does not interrupt the service.

It is essential to plan the migration thoroughly. Every step in the plan must be determined beforehand and should initially have enough detail that any engineer could perform the migration by following the documentation. Without a plan, you won’t know how long it will take, you will have to improvise, and the migration will cause much stress.

Migrations always run into trouble, so the plan must not include only the steps to follow if everything goes well (the happy path). It must also contain instructions to follow if problems turn up. It is often possible to fix the problem in the moment and continue, but sometimes, it will be necessary to stop the migration while we figure out what to do next.

This takes me to the second point: the migration process must be reversible. Imagine you are making a complex change in the service, run into a problem, and want to put everything back the original way while you investigate. If the changes are not reversible, you won’t be able to undo them, and then you will have two problems instead of only one.

Some changes are, by nature, irreversible (for example, removing a column from a database table). The best thing you can do about those changes is to leave them for last, when every other step in the migration is already done and tested. If you cannot do that, you should prepare your system to run with that change done only halfway. For example, imagine you have many replicas, and half are migrated while the rest aren’t; it’s much better if your system can use all and not only half of them.

Even in the absence of trouble, it’s much better to perform a gradual migration, little by little, than doing it all at once. The main benefit is observing the system’s behavior at every step of the migration and detecting problems before they become too big. In the above example, we could migrate one replica daily while we monitor them all to check that they work well and can withstand the workload.

Another significant benefit of a gradual migration is the ability to perform it without interrupting the service. I have seen too many web pages saying, “This website will be closed for maintenance over the next week.” Migrations always take longer than you expect, so if the service must be closed while they happen, we will be in trouble if they take too long.

Therefore, how can you replace your system’s database? Obviously, not by closing the system, copying the data, changing the configuration, and reopening it because I just wrote almost 600 words advising against it.

You should use two databases, reading and writing data to both of them simultaneously, and have a process that copies the old data from the old database to the new one in the background. It won’t take long before all the data is migrated without a break in service, and after checking that both databases contain the same information, you can close the old one and declare your migration complete.