← Back to the blog

Building resilient CLIs in Go

Why every serious CLI tool needs a retry budget, a structured error surface, and a story for SIGTERM — and how we codified that in resilient-cli.

If you've been writing Go long enough, you've probably shipped a CLI that looked production-ready and turned out not to be. It had nice flags. It had colourful output. It exited zero on success. And then a customer ran it on a laptop with a flaky VPN, and the whole thing hung for forty minutes until they killed it and filed a bug.

At Taqnihub we've shipped about a dozen CLIs for clients over the last three years. Each time, we ran into the same three problems. So we eventually extracted the fix into an open-source package — resilient-cli — and wrote this note to explain why the package looks the way it does.

TL;DRA resilient CLI needs three things most people skip: a retry budget, a structured error surface, and a clean story for SIGTERM. Leave any one out and you will get the bug report.

1. The retry budget

The most common CLI bug we see is retries without a budget. Someone wraps the network call in a loop, adds exponential backoff, and calls it a day. That's fine for transient failures. It's catastrophic when the backend is down for an hour.

A retry budget caps two things: the total wall-clock time spent retrying, and the total number of attempts. When either is exhausted, the CLI fails fast with a specific, actionable error. Here's the shape:

// Budget caps retry duration AND attempt count.
budget := rcli.NewBudget(
  rcli.WithMaxElapsed(30*time.Second),
  rcli.WithMaxAttempts(5),
)

err := budget.Do(ctx, func() error {
  return api.PostInvoice(ctx, invoice)
})

The failure mode matters. When the budget blows, the error you return should tell the user which attempt failed, how long you spent on each one, and what they should do about it. Which brings us to the second thing.

2. A structured error surface

Most CLIs print one of two things on failure: a Go error wrapped to death (failed to do thing: failed to do other thing: failed to: EOF), or a beautifully formatted banner that throws away all of the context a developer actually needs to file a bug.

The right answer is both. Give the user a human message. Give the machine — and any future grep — a structured payload next to it.

// One error type. Two audiences.
type Failure struct {
  Message    string    // for humans, past tense
  Cause      error     // full wrapped chain
  Code       string    // stable, snake_case
  Retryable  bool      // did we give up, or crash?
  Context    map[string]any // request id, region, etc.
}
A CLI that can't be grepped after the fact is a CLI that will be re-run until somebody grabs a screenshot.

We print the Message to stderr in a colour-safe block, write the full Failure as JSON to $XDG_STATE_HOME/rcli/last-error.json, and exit with an exit code derived from Code. The first two lines of our bug report template are now paste the contents of last-error.json.

3. A story for SIGTERM

If your CLI runs in CI, it will one day be killed by SIGTERM — because the runner timed out, because a dependent job failed, because somebody hit cancel. The vast majority of CLIs handle this by doing nothing, which means:

  • half-uploaded files stay half-uploaded on the server
  • the local cache ends up in a state nothing can recover from
  • progress bars claim 100% on the next run, because the marker was written before the work finished

The fix is a context that actually propagates. The top of every main() in a Taqnihub CLI looks like this:

ctx, cancel := signal.NotifyContext(
  context.Background(),
  os.Interrupt, syscall.SIGTERM,
)
defer cancel()

if err := run(ctx); err != nil {
  rcli.PrintFailure(os.Stderr, err)
  os.Exit(rcli.ExitCode(err))
}

Every long-running operation accepts ctx, checks it at natural boundaries, and — this is the part most code skips — cleans up on the way out. Half-written files get renamed to .partial. In-flight HTTP requests get a last-ditch best-effort cancel. The cache index gets flushed.

Putting it together

None of this is novel. It's just the stuff that people skip when they're shipping a CLI at 5pm on a Friday. resilient-cli bundles sensible defaults for all three so you can stop rewriting them.

If you want to read the code, it's on our open source page. If you want us to review your CLI — or build one for you — here's how.

★ ★ ★

End of article · Thanks for reading

Subscribe

More of this, once a month.