Work / Lumen Data
Lumen Data · Analytics · Bangalore

A Postgres migration that stopped paging the on-call.

Moving Lumen's 4TB analytics Postgres instance to RDS with zero downtime — and leaving behind a migration workflow the team still uses today.

4 weeksDuration
2 engineersTeam
2024Shipped
Postgres · Go · TerraformStack
Outcome

The numbers, six months after cutover.

0s
Downtime during cutover. No paging rotation woken at night.
Faster analytical queries post-migration — IOPS was the bottleneck.
sqlx-migrate
The migration toolchain, extracted and now open source for anyone.

01 · The context

Lumen runs an analytics platform for e-commerce companies. Their Postgres cluster was 4TB and growing fast on a self-managed VM — doing roughly 11,000 queries per second at peak. The on-call engineer had been paged six times in the prior quarter for disk pressure alone.

The answer was obvious: move to managed Postgres on RDS. The hard part was doing it without a maintenance window. Lumen's customers run dashboards 24/7 from three continents. A minute of downtime is a ticket; an hour is a reputation problem.

The real problemMost "zero-downtime" guides assume a well-behaved schema. Lumen's had twelve years of history in it, including three tables with circular foreign keys we had to unwind before anything else could move.

02 · The approach

We ran a two-week schema review before we touched any infrastructure. That produced a prioritized list of thirty-one schema cleanups, each of which could ship independently. By the time we were ready to cut over, the database was already healthier.

The cutover itself used logical replication from the old cluster to RDS, a carefully scripted sequence to promote the replica, and a DNS swap. We wrote the promotion runbook as code — every step was an idempotent Go function with its own test — and rehearsed the whole thing four times in staging.

"They refused to do the cutover until we'd done four full-scale dry runs. I understood why halfway through the third one."

03 · The architecture

The end state: a managed RDS instance, logical replication decommissioned, and every schema change expressible as a reversible migration with a round-trip guarantee across dev, test, and prod. We extracted the tooling into sqlx-migrate, now an OSS package.

// Every migration round-trips cleanly.
$ sqlx-migrate up // apply pending
$ sqlx-migrate down // revert last
$ sqlx-migrate lint // check every .up has a matching .down
$ sqlx-migrate test // round-trip in a disposable db

Lumen still uses the same workflow today, fifteen months after handoff. The paging numbers tell the simplest story: zero Postgres-related pages in the six months after cutover, down from an average of two per week before.

04 · The outcome

Cutover happened on a Thursday at 14:00 local time — on purpose. We wanted engineers awake, dashboards lit, and customer support on standby. The swap took 42 seconds of read-only mode. No customer noticed.

Six months later, Lumen's p95 query latency was down 3×, they hadn't been paged about Postgres once, and the migration workflow we'd built had become the default way the team shipped schema changes. We consider that the real outcome — not the cutover itself.

Back to work

See every case study we've published.

All case studies