Back to Course

Databases in Production · 7 of 8

Zero-downtime migrations

The patterns that keep ALTER TABLE off the critical path: add column with default in two steps, add index concurrently, rename via shadow column. What strong_migrations enforces and the table-size reflex that separates safe deploys from 9 a.m. incidents.

Where this rule comes from

DDL statements in Postgres (ALTER TABLE, CREATE INDEX, DROP COLUMN) take locks on the tables they modify. The default lock for most DDL is AccessExclusiveLock: blocks every other operation on the table, including reads. For a small table this lock is acquired and released in milliseconds. For a 50-million-row table, the operation under the lock can take ten minutes, during which every query that touches the table waits in the connection pool.

The history of Postgres DDL has been a slow march toward shorter locks. Postgres 11 made adding a column with a default a metadata-only operation (no row rewrite). Postgres 12 made certain ALTER COLUMN operations faster. CREATE INDEX CONCURRENTLY has existed since 8.2 for building indexes without locking writes. The catalog of "safe DDL" has grown, but the trap remains: a single naive ALTER TABLE in a migration file can lock a production table for the entire deploy window.

Zero-downtime migration patterns are the catalog of "do this DDL the right way." The senior reflex on any migration touching a large table is to check the table size before reading the migration, then map the proposed change onto a known safe pattern. Most migrations have a safe form; the ones that do not need to be split into multiple steps across deploys.

The anti-pattern: the unread Rails migration

A team ships a migration without thinking about table size:

class AddStatusToOrders < ActiveRecord::Migration[7.1]
  def change
    add_column :orders, :status, :string, default: 'pending', null: false
  end
end

On a 10,000-row dev database, the migration runs instantly. On a 50-million-row production orders table, the migration acquires AccessExclusiveLock on the table at 09:01, starts rewriting every row to populate the default value, and finishes at 09:14. For 13 minutes, every read and write of the orders table waits. The site is down for everything order-related.

Postgres 11 fixed this specific case: ADD COLUMN ... DEFAULT ... NOT NULL is now metadata-only. The default is stored separately and applied at read time for existing rows. The migration above is safe on PG11+ at any table size.

But other migrations still trigger row rewrites or long locks. Changing a column type. Setting NOT NULL on an existing nullable column. Adding a CHECK constraint without NOT VALID. Removing a NOT NULL. The team that writes the migration usually does not see these on their dev database, because dev tables are tiny. The senior reflex: open the production table size before reading the migration body.

Adding columns safely

On Postgres 11+:

  • ADD COLUMN with no default or with a constant default: safe at any size. Metadata-only.
  • ADD COLUMN with a volatile default (DEFAULT random(), DEFAULT NOW()): rewrites every row. Unsafe at large size.
  • ADD COLUMN with NOT NULL and no default: fails on tables with rows. Set the default first, then add NOT NULL after backfill.

The standard pattern for "add a column that should be NOT NULL with values for existing rows":

# Deploy 1: add the column nullable, backfill in batches
class AddStatusToOrders < ActiveRecord::Migration[7.1]
  def change
    add_column :orders, :status, :string, default: 'pending'
  end
end

# Background job or rake task, after deploy 1
Order.in_batches(of: 10_000) do |batch|
  batch.where(status: nil).update_all(status: 'pending')
end

# Deploy 2: add NOT NULL constraint after backfill is done
class AddStatusNotNullToOrders < ActiveRecord::Migration[7.1]
  def change
    change_column_null :orders, :status, false
  end
end

The reason for splitting: the NOT NULL constraint takes a brief AccessExclusiveLock to validate, but the table contents are already populated by the backfill, so the validation runs quickly. The alternative (one migration that adds NOT NULL during the row rewrite) blocks the table for the duration of the rewrite.

Adding indexes concurrently

CREATE INDEX takes a ShareLock on the table: reads continue, writes block. On a 50M-row table, the writes block for 5 to 30 minutes. CREATE INDEX CONCURRENTLY takes a smaller lock and lets writes continue while it builds the index in two scans. It is slower overall (two scans) but does not block traffic.

class AddIndexOnOrdersUserId < ActiveRecord::Migration[7.1]
  disable_ddl_transaction!

  def change
    add_index :orders, :user_id, algorithm: :concurrently
  end
end

Two constraints. disable_ddl_transaction! is required because CONCURRENTLY cannot run inside a transaction. If the index build fails partway through, it leaves an invalid index that has to be dropped manually with DROP INDEX.

The "always concurrently in production" rule has one exception: on a tiny table during development, the synchronous CREATE INDEX is fine and easier to debug. For any production table over 100,000 rows, default to concurrently.

Renaming a column via shadow column

ALTER TABLE ... RENAME COLUMN is safe at the database level (metadata-only on Postgres). The trap is at the application level. Between the moment the migration runs and the moment new code referencing the new name is deployed, requests served by old code that reference the old name break. The deploy window becomes a multi-minute period of broken requests.

The safe pattern is a shadow column across three deploys:

  1. Deploy 1: add the new column, copy values from the old one. Update application code to write both old and new on every change, but keep reading from old.
  2. Deploy 2: switch application reads to the new column. Old column is still being written to (in case of rollback).
  3. Deploy 3: stop writing the old column. Drop it in a follow-up migration.

The shadow-column pattern is the same shape for renames, column-type changes, and splitting a column into two. The principle is that any database-level change which takes effect instantly must not require a coordinated change in application code that has not been deployed yet. Decouple the database change from the code change by giving both old and new the data they need at the same time.

Dropping a column safely

Dropping a column is a metadata-only operation: the column header is removed from the table catalog, but the existing data stays on disk until the next VACUUM FULL or pg_repack. The lock is brief.

The danger is not in the migration; it is in Active Record. AR caches schema metadata at boot. If a Rails process has cached the column's existence and the migration drops it, the next time AR tries to use that column the query fails. The pattern that avoids the bug:

  1. Deploy 1: tell Active Record to ignore the column via self.ignored_columns = [:column_name] on the model. Confirm production is happy with the column ignored.
  2. Deploy 2: run the migration that drops the column. Old AR processes serving requests have the column ignored, so they do not try to use it.
  3. Deploy 3 (optional): remove the ignored_columns line.

Skipping the ignore step and dropping the column in one deploy works most of the time. The times it does not result in a brief window of "column does not exist" errors that hit users.

Backfilling large tables

A migration that runs User.where(verified: nil).update_all(verified: false) on a 10-million-row table updates every row in one transaction. The transaction holds a lock on every row it touches, generates 10 million dead tuples in one go (lesson 5), and stays in the WAL log until it commits. The right approach is batching:

class BackfillUserVerifiedFlag < ActiveRecord::Migration[7.1]
  disable_ddl_transaction!

  def up
    User.unscoped.in_batches(of: 5_000) do |batch|
      batch.where(verified: nil).update_all(verified: false)
      sleep 0.1  # Give autovacuum and replicas room to keep up
    end
  end

  def down
    # Reversal is usually a no-op for a backfill
  end
end

Each batch is its own transaction. Locks are short, dead tuples are spread across the run instead of concentrated, autovacuum can keep up. The optional sleep is for very busy systems where the backfill itself competes with live traffic for resources. For backfills that take more than a few minutes, run them outside the migration entirely (as a Rake task or a one-off Sidekiq job) so a failure does not block the deploy.

strong_migrations as the safety net

The strong_migrations gem reads migration files and blocks ones that match known unsafe patterns. Adding a column with a volatile default. Creating an index without CONCURRENTLY on a table over a threshold. Changing a column type. The list is long and well-maintained.

The gem does not catch every issue (large backfills inside migrations are usually allowed because the gem cannot judge the data size), but it catches the most common foot-guns. Adding it to a Rails app takes one line in the Gemfile and one initializer. For any team shipping migrations more than once a week, strong_migrations is worth the install cost on day one.

The principle at play

Most production database incidents are not novel; they are well-catalogued mistakes hitting a new team that does not know the catalog. The patterns in this lesson exist because thousands of teams have already learned them the expensive way. The senior version of the workflow is short:

  1. Open production table size before reading the migration.
  2. If the table is large, check whether the migration matches a known safe pattern. If not, redesign across multiple deploys.
  3. For indexes on large tables: algorithm: :concurrently.
  4. For backfills on large tables: out-of-band rake task in batches.
  5. For renames and drops: shadow column or ignored_columns across deploys.

Every one of these is documented in the strong_migrations README and reinforced by anyone who has spent a 9 a.m. incident on the receiving end of one. The senior reflex is making the checks automatic instead of remembered.

Practice exercise

  1. Install strong_migrations in your app. Run the existing migration suite against it to see what would have been caught.
  2. Pick the largest table in your app. Run SELECT pg_size_pretty(pg_total_relation_size('table_name')), reltuples::bigint FROM pg_class WHERE relname = 'table_name';. Memorise the number. The next migration that touches it should be evaluated against this size, not against your dev database's.
  3. Find a recent migration that added a column with a default. Was it on PG11+? Was the default constant? If both yes, it was safe. If not, work out what the safe shape would have been.
  4. Pick a column you would rename if you could. Sketch the three-deploy shadow-column plan. Estimate how many days it takes to ship end-to-end. If the answer is "more than a week," the rename is more expensive than living with the imperfect name; that itself is a senior tradeoff worth recognising.

Related lessons