Why every deploy must be boring
Users notice two things: new features that delight them, and downtime that angers them. A migration that locks a table for ninety seconds at 03:00 still triggers alerts, support tickets, and churn. The goal is not to migrate during low traffic; the goal is to migrate with no perceived traffic change. This playbook shows you how.
The physics of a blocking migration
Relational engines stay consistent by taking metadata locks. When you add a NOT NULL
column or rewrite a primary key, MySQL and PostgreSQL create an AccessExclusive lock that queues every subsequent query. On a hot table, the queue explodes, connections saturate, and the application wedges. The problem is mechanical, not mystical.
Pick your flavor: online, rolling, or blue-green
Online DDL with built-in tools
PostgreSQL 12+ brings ALTER TABLE ... ADD COLUMN ... DEFAULT ... NOT NULL
that no longer rewrites the heap. MySQL 8.0 offers ALGORITHM=INPLACE
for many operations. These are the cheapest wins: test them first in staging sized like production.
Rolling (expand-contract)
Code and schema evolve in lock-step across three phases:
- expand: add new objects (column, table, index) without touching old ones
- dual-write: application writes to both old and new spots, reads still use old
- contract: once data is identical, cut reads to new, drop old
This pattern runs with ordinary migrations plus feature flags, so rollbacks stay one-click.
Blue-green cut-over
Spin up an entire replica stack, migrate the idle clone, then swap user traffic at the load balancer. Costly but bullet-proof for financial or medical systems where rollback windows are near zero.
Laying the groundwork
Always snapshot first
Even with online tools, take a logical backup or use your cloud vendor’s point-in-time copy. Store the snapshot name in the ticket that tracks the deploy; confusion kills more DR plans than bad code.
Measure the blast radius
Run pg_stat_statements
or the MySQL slow log for twenty-four hours. Identify tables with >10 % of total queries. These are your high-risk targets; every change there needs extra scrutiny and a canary release.
Pre-migration health check
Before you merge the pull request, assert:
- replication lag < 2 s on all standbys
- free disk > 30 % on every node
- no long-running transactions older than the last checkpoint
Fail the CI pipeline if any check fails; humans override too slowly under pressure.
PostgreSQL tactics that work today
Add column without default
This is instantaneous; the row format expands lazily on next update. If you need a default, split it:
ALTER TABLE users ADD COLUMN newsletter BOOLEAN; -- sub-ms ALTER TABLE users ALTER COLUMN newsletter SET DEFAULT false; -- still sub-ms UPDATE users SET newsletter = false WHERE newsletter IS NULL; -- optional back-fill in batches
Create index concurrently
CREATE INDEX CONCURRENTLY idx_user_email ON users(email);
skips the standard lock but builds in two passes. Monitor for write amplification; if autovacuum triggers at the wrong moment, the build restarts automatically.
Rewrite the primary key safely
Need UUIDs? Create a new table, sync with logical replication or pglogical
, then swap via rename inside a single transaction once rows match. The rename takes microseconds; users never leave the site.
MySQL tactics that work today
Instant DDL checklist
MySQL 8.0.29 marks these operations instant:
- add column (last position, no default)
- drop column (if not part of FK or index)
- enum value appended
For anything else, use pt-online-schema-change
or GitHub’s gh-ost
. Both tools:
- create shadow table like the original
- install triggers to stream new rows
- copy chunks while throttling on replica lag
- atomically rename
Throttling done right
Hook into your replica lag metric; pause when lag > 3 s. Also watch threads_running; if it spikes above 1.5 × cores, back off. These two knobs prevent the deadly migration thundering herd.
Migration scripts as code
Checked-in SQL is not enough. Wrap every change in a small runner script that:
- opens a transaction only when safe; online tools do not
- writes start time and git sha to a migration log table
- catches any exception and emits a rollback plan
- records finish time and row counts
Ops teams love a single log table more than a Slack novel.
Testing without staging prod-copy
Full prod clones are pricey. Two cheaper layers catch 90 % of surprises:
Shadow traffic replay
Use tcpdump
to capture a five-minute slice of production queries, anonymize them, then replay against a containerized copy at 2× speed. This reveals lock conflicts without real users.
Synthetic chaos
Run a background job that updates the hottest table continuously while your migration script executes. A simple bash loop with psql
or mysql
client is enough to surface timing issues you will only see at scale.
Rollback is a feature
If you cannot roll back in under five minutes, you do not have a release—you have a hope. Design every change backward first:
- keep the old column unread for one release cycle
- add compensating code that migrates data back on downgrade
- store a copy of the old value in a JSONB staging column when data mutates
Test the rollback weekly in staging; muscle memory matters during a 04:00 page.
Monitoring the cut-over
Dashboards must answer three questions:
- are writes succeeding?
- are reads returning correct data?
- is replication still healthy?
Put the graphs on a wall-mounted screen during deploy. Silence is golden; any spike triggers an immediate pause, not a heroic push forward.
Common traps and how to dodge them
Foreign key validation storms
Adding a foreign key checks every row. Split it: add the key NOT VALID
in one release, then run VALIDATE CONSTRAINT
later. The second step only needs a share lock that reads but does not block writes.
Default expressions that call now()
PostgreSQL evaluates once per statement, not row. A million-row update gets the same timestamp, defeating audit trails. Pre-compute the default in application code or use generated columns instead.
Unique indexes on low-cardinality flags
A boolean column with mostly false creates a sparse index that still blocks. Replace with a partial index WHERE active = true
or move the flag into a status enum.
Tooling cheat sheet
- Liquibase: Java-friendly, supports preconditions, rollback auto-generation
- Flyway: lighter, SQL-first, integrates with every CI platform
- sqitch: diff-based, forces idempotent scripts, PostgreSQL-centric
- gh-ost: battle-tested for MySQL, throttles on replica lag out of the box
- ansible.posgresql: idempotent modules for online index creation
Putting it all together: sample rollout
- Ticket created three days before release with rollback plan
- Feature flag created in LaunchDarkly, default off
- Migrations merged to main, CI runs replay tests
- Deploy to canary region (5 % traffic) at 14:00 local
- Dashboard green for two hours, promote to 100 % by 16:00
- Next morning, feature flag cleaned up, old column marked deprecated
- Drop migration scheduled for two releases later, after backups
Nobody pages the on-call, product managers smile, investors stay calm.
Key takeaways
Zero downtime is not a single tool—it is a mindset: expand before you contract, test rollbacks before you need them, and never trust a lock you did not measure. Start with the smallest table, practice the pattern, then scale the habit. Your users will never notice, and that is the highest praise a database can receive.
Disclaimer: This article is for educational purposes only and does not constitute production advice. Always test in an environment that mirrors your regulatory and performance constraints. The content was generated by an AI language model and should be verified against official documentation.