Work / Vellum Health
Vellum Health · Healthtech · Stockholm

A billing pipeline that bills correctly, every time.

Rebuilding Vellum's claims processing from a fragile cron soup into a resilient, observable Go pipeline — and open-sourcing the hard parts.

6 weeksDuration
2 engineersTeam
2025Shipped
Go · Postgres · NATSStack
Outcome

The numbers, after six weeks of work.

99.98%
Of claims processed correctly on the first pass, up from 91.4%.
−62%
Reduction in manual reconciliation work per billing cycle.
380ms
Median end-to-end latency. Down from 4.2 seconds at P50.

01 · The context

Vellum is a clinical admin platform for Nordic primary care groups. Every month, they process about 180,000 insurance claimson behalf of their customers. When we were introduced, about 9% of those claims were failing on first pass — not because of rejections from insurers, but because of race conditions and lost messages inside Vellum's own pipeline.

The existing system was a reasonable first version: a Rails monolith, a few background workers, and about a dozen cron jobs held together with good intentions and a Notion page called "Deploy Sequence". It had worked for a long time. It wasn't going to survive the next 3× in volume.

The real problemNobody on the team could confidently describe the ordering guarantees between steps. When a claim failed, the reconciliation was manual. That's a people-cost, and it was growing linearly.

02 · The approach

We ran a one-week discovery on-site in Stockholm. The output was a 14-page memo with a clear recommendation: replace the pipeline with a small, typed Go service built around append-only claim events and a stateless worker pool. Keep Rails for everything else.

Everyone agreed that rewrites are usually a mistake. This wasn't a rewrite — it was a carve-out. We'd own the claim lifecycle, expose an HTTP API for the monolith to call, and leave every other system alone. The new service went into production behind a feature flag on day 15.

"The memo alone was worth the engagement. We ended up using it to onboard two new engineers before the code was even merged."

03 · The architecture

The pipeline is four stages: ingest, normalize, submit, reconcile. Each stage reads from and writes to Postgres, emits events via NATS, and is idempotent on a claim ID. We used go-pipeline (extracted from this project and now OSS) for fan-out and backpressure.

// Each worker is a small, typed handler.
func (w *ReconcileWorker) Handle(ctx context.Context, c Claim) error {
  result, err := w.insurer.Settle(ctx, c)
  if errors.Is(err, ErrTransient) {
    return pipeline.Retry(err)
  }
  return w.db.Settle(ctx, c.ID, result)
}

Observability got first-class treatment. Every stage writes a structured span; every claim has a queryable lifecycle page in the internal admin. When a claim gets stuck, the on-call engineer can see exactly where and why in under 30 seconds.

04 · The outcome

We shipped on week six, migrated live traffic over a weekend using the feature flag, and stayed for another week of calibration. Then we handed it off — documented, tested, and instrumented — and open-sourced the queue primitive as go-pipeline.

Vellum's team now ships changes to the claim lifecycle in hours instead of days. The error rate has stayed below 0.05% for the six months since handoff, and the manual reconciliation queue is essentially empty.

Next case study

A CLI engineers actually want to use.

Read Cartograph