Spot the Tax · Card 13 of 20

Background jobs must be safe to run twice

Why a Sidekiq retry will charge your customer a second time if the job isn't idempotent.

The code

What will this cost you in six months?

class ChargeUserJob < ApplicationJob
  def perform(user_id, amount_cents)
    user = User.find(user_id)
    Stripe::Charge.create(
      amount: amount_cents,
      currency: "usd",
      customer: user.stripe_customer_id
    )
  end
end

The problem

Sidekiq automatically retries jobs that fail. The failure could be a network blip mid-API-call, a worker getting a SIGTERM during a deploy, a timeout on the response from Stripe. In all of those cases the API call may have already succeeded — you just didn't get the response back. Sidekiq then runs the job again and Stripe charges the customer a second time. The job worked. It just worked twice. The customer calls support.

Take a moment. Before revealing, think about what would have to be true for the job to be safe to run any number of times. What state do you need to track, and where?

The solution

Make the work safe to run any number of times. Pass an idempotency key to Stripe so the second call returns the original result without doing anything new, and store local state so the job short-circuits before the API call on retry.

Sidekiq can retry the job freely without anyone getting double-charged
Stripe deduplicates by the idempotency key on its side
You can re-run the job manually to recover from failures, without fear

class ChargeUserJob < ApplicationJob
  def perform(charge_request_id)
    request = ChargeRequest.find(charge_request_id)
    return if request.completed?

    result = Stripe::Charge.create(
      amount: request.amount_cents,
      currency: "usd",
      customer: request.user.stripe_customer_id,
      idempotency_key: "charge-request-#{request.id}"
    )

    request.update!(completed: true, stripe_charge_id: result.id)
  end
end

The principle at play — Idempotency

Anything that runs in a system you don't fully control — background workers, message queues, retried HTTP calls — can run more than once. Network failures don't tell you whether the work succeeded; they just tell you that the response didn't make it back. So any work that can be retried has to be safe to retry, which means doing it once and doing it ten times have to produce the same result.

For local state, that's usually a check-then-act pattern: look at a record before doing the work, skip if it's already done, mark it done after. For external APIs, most modern providers expose an idempotency key — you generate a unique ID for the operation, send it with the call, and the API guarantees that subsequent calls with the same key return the original result instead of doing the work again.

The hard part isn't the technique, it's noticing that retries are something to design for in the first place. The job that "works" the first time is exactly the one that bites you the day a worker dies between the API call and writing the result back.

← Rescue only what you can recover from User.all.each will eventually run out of memory →