01 · The context
Vellum is a clinical admin platform for Nordic primary care groups. Every month, they process about 180,000 insurance claimson behalf of their customers. When we were introduced, about 9% of those claims were failing on first pass — not because of rejections from insurers, but because of race conditions and lost messages inside Vellum's own pipeline.
The existing system was a reasonable first version: a Rails monolith, a few background workers, and about a dozen cron jobs held together with good intentions and a Notion page called "Deploy Sequence". It had worked for a long time. It wasn't going to survive the next 3× in volume.
02 · The approach
We ran a one-week discovery on-site in Stockholm. The output was a 14-page memo with a clear recommendation: replace the pipeline with a small, typed Go service built around append-only claim events and a stateless worker pool. Keep Rails for everything else.
Everyone agreed that rewrites are usually a mistake. This wasn't a rewrite — it was a carve-out. We'd own the claim lifecycle, expose an HTTP API for the monolith to call, and leave every other system alone. The new service went into production behind a feature flag on day 15.
03 · The architecture
The pipeline is four stages: ingest, normalize, submit, reconcile. Each stage reads from and writes to Postgres, emits events via NATS, and is idempotent on a claim ID. We used go-pipeline (extracted from this project and now OSS) for fan-out and backpressure.
Observability got first-class treatment. Every stage writes a structured span; every claim has a queryable lifecycle page in the internal admin. When a claim gets stuck, the on-call engineer can see exactly where and why in under 30 seconds.
04 · The outcome
We shipped on week six, migrated live traffic over a weekend using the feature flag, and stayed for another week of calibration. Then we handed it off — documented, tested, and instrumented — and open-sourced the queue primitive as go-pipeline.
Vellum's team now ships changes to the claim lifecycle in hours instead of days. The error rate has stayed below 0.05% for the six months since handoff, and the manual reconciliation queue is essentially empty.