Spot the Tax · Card 13 of 20
Background jobs must be safe to run twice
Why a Sidekiq retry will charge your customer a second time if the job isn't idempotent.
The code
What will this cost you in six months?
class ChargeUserJob < ApplicationJob
def perform(user_id, amount_cents)
user = User.find(user_id)
Stripe::Charge.create(
amount: amount_cents,
currency: "usd",
customer: user.stripe_customer_id
)
end
end The problem
Sidekiq automatically retries jobs that fail. The failure could be a network blip mid-API-call, a worker getting a SIGTERM during a deploy, a timeout on the response from Stripe. In all of those cases the API call may have already succeeded — you just didn't get the response back. Sidekiq then runs the job again and Stripe charges the customer a second time. The job worked. It just worked twice. The customer calls support.
Take a moment. Before revealing, think about what would have to be true for the job to be safe to run any number of times. What state do you need to track, and where?
The solution
Make the work safe to run any number of times. Pass an idempotency key to Stripe so the second call returns the original result without doing anything new, and store local state so the job short-circuits before the API call on retry.
- Sidekiq can retry the job freely without anyone getting double-charged
- Stripe deduplicates by the idempotency key on its side
- You can re-run the job manually to recover from failures, without fear
class ChargeUserJob < ApplicationJob
def perform(charge_request_id)
request = ChargeRequest.find(charge_request_id)
return if request.completed?
result = Stripe::Charge.create(
amount: request.amount_cents,
currency: "usd",
customer: request.user.stripe_customer_id,
idempotency_key: "charge-request-#{request.id}"
)
request.update!(completed: true, stripe_charge_id: result.id)
end
end The principle at play — Idempotency
Anything that runs in a system you don't fully control — background workers, message queues, retried HTTP calls — can run more than once. Network failures don't tell you whether the work succeeded; they just tell you that the response didn't make it back. So any work that can be retried has to be safe to retry, which means doing it once and doing it ten times have to produce the same result.
For local state, that's usually a check-then-act pattern: look at a record before doing the work, skip if it's already done, mark it done after. For external APIs, most modern providers expose an idempotency key — you generate a unique ID for the operation, send it with the call, and the API guarantees that subsequent calls with the same key return the original result instead of doing the work again.
The hard part isn't the technique, it's noticing that retries are something to design for in the first place. The job that "works" the first time is exactly the one that bites you the day a worker dies between the API call and writing the result back.