Scaling Series · 6 of 8

Background Jobs as a Scaling Tool

Where Rails background jobs come from, why the request cycle is for translation rather than work, and how Sidekiq and Solid Queue let teams scale throughput by moving work off the request path.

Where this rule comes from

The HTTP request cycle was designed for translation: a browser sends a request, the server responds, the connection closes. Anything that happens between request and response holds resources open: a Puma thread, a database connection, sometimes a load-balancer slot. The longer the response takes, the more concurrent capacity you need to handle the same throughput.

Rails developers learned this early. The first wave of Rails background-job libraries arrived around 2006-2008, BackgrounDRb, then delayed_job, then Resque. The core insight was the same across all of them: move work that does not need to happen synchronously into a separate process that picks up jobs from a queue. The request cycle stays short; the work still happens; the user does not wait.

Sidekiq, released by Mike Perham in 2012, became the dominant choice for a decade. It used threads inside one process, backed by Redis, and could process tens of thousands of jobs per second on modest hardware. Rails 4.2 added ActiveJob in 2014, an adapter layer that let application code call perform_later without committing to a specific backend.

Solid Queue arrived with Rails 8 in 2024, written at 37signals for Hey. It is database-backed (Postgres or MySQL), removes the Redis dependency, and ships as the default queue adapter for new Rails 8 apps. The two-schools framing applies here: Solid Queue and Sidekiq are both senior choices, with different tradeoffs spelled out below.

The senior rule across all of these: the request cycle is for translation, turning HTTP requests into model calls and back. The work itself, if it takes more than a few milliseconds, belongs in a background job.

The anti-pattern

Picture a Rails app where the team has been adding integrations one at a time. Each integration's network call landed in the controller because that was the simplest place at the time:

class OrdersController < ApplicationController
  def create
    @order = current_user.orders.create!(order_params)

    # External calls, all synchronous, all in the request path:
    Stripe::Charge.create(amount: @order.total_cents, customer: ...)  # 800ms
    MailchimpClient.add_to_purchasers_list(current_user.email)         # 400ms
    SlackNotifier.post("New order ##{@order.id}")                      # 200ms
    SegmentClient.track("order_completed", user_id: current_user.id)   # 150ms
    OrderMailer.confirmation(@order).deliver_now                       # 600ms

    redirect_to @order
  end
end

The user clicks "Place Order." Their browser waits for 2.15 seconds. A Puma thread is held for the entire duration. A database connection sits idle inside the open transaction (if there is one) while five external HTTP calls happen serially.

The compounding problems:

Any one of the five integrations failing fails the order. If Mailchimp is down, the user sees a 500 and the order is half-saved. The team's first fix is usually rescue StandardError blocks around each call, which silently swallows real bugs.
Slow integrations slow every order. If Stripe responds in 800ms today and 4 seconds during a Stripe incident, every order takes 4+ seconds, and the team's web dynos run out of threads.
Throughput is bottlenecked on the slowest external service. The team's web capacity is not "how many requests can Puma serve", it is "how long does the slowest external call take, times the request rate."
Retries are not possible. If Slack's API returns a 429 rate-limit response, there is no way to retry without retrying the whole order.

Each of these is a real production incident shape, and all of them have the same fix: move the external calls into background jobs, and respond to the user as soon as the order is saved.

The fix

Each external call becomes its own job. The controller does the minimum work synchronously (save the order, charge the card if absolutely necessary), then queues the rest:

class OrdersController < ApplicationController
  def create
    @order = current_user.orders.create!(order_params)

    ChargeOrderJob.perform_later(@order.id)
    AddToMailchimpJob.perform_later(current_user.id)
    NotifySlackJob.perform_later(@order.id)
    TrackOrderCompletedJob.perform_later(current_user.id, @order.id)
    SendOrderConfirmationJob.perform_later(@order.id)

    redirect_to @order
  end
end

class ChargeOrderJob < ApplicationJob
  retry_on Stripe::APIConnectionError, wait: :polynomially_longer, attempts: 5

  def perform(order_id)
    order = Order.find(order_id)
    return if order.charged?

    Stripe::Charge.create(amount: order.total_cents, customer: order.user.stripe_customer_id)
    order.update!(charged: true, charged_at: Time.current)
  end
end

The controller now returns in ~80ms instead of 2,150ms. The Puma thread is freed in 1/27th the time, which means the same web dyno can serve 27x more orders per second. The work still happens, asynchronously, in worker processes that scale independently of the web tier.

Two structural patterns to notice:

1. Pass record IDs, not records. ChargeOrderJob.perform_later(@order.id), not ChargeOrderJob.perform_later(@order). The job will run later, possibly minutes later, and the record may have changed in the meantime. Re-fetching from the database inside the job means the job always works on the current state of the record, not a frozen snapshot from the request.

2. Make jobs idempotent. return if order.charged? means running the job twice never charges the card twice. This is the single most important property of a well-designed background job: it can be re-run safely, because retries, manual re-queues, and rare delivery duplications are all normal. "Crash the dyno mid-job, re-run it on restart" should result in the same outcome as "run it once cleanly."

Solid Queue vs Sidekiq

Two main choices in modern Rails. Both are senior; the tradeoff is real and worth understanding:

Sidekiq (Redis-backed). High throughput, tens of thousands of jobs per second per process. Mature ecosystem (sidekiq-cron, sidekiq-batch, etc.). Excellent web UI for monitoring. Pro/Enterprise tiers add unique-job, batching, and prioritization features. Requires running Redis as a separate service.
Solid Queue (Postgres- or MySQL-backed). Ships with Rails 8 as the default. No Redis to operate. Slightly lower throughput than Sidekiq (Redis is a faster store than Postgres for queue ops), still well above what most apps need. Transactional with your application database, you can enqueue a job inside the same transaction that creates the record.

The choice usually comes down to two questions: do you already run Redis (then either is fine, Sidekiq is slightly more mature), and is your queue throughput in the "millions of jobs per day" range (then Sidekiq's per-second throughput matters more than for typical apps).

The 37signals school uses Solid Queue. Shopify uses Sidekiq. Both are senior. Pick the one that fits your team's operational baseline.

Retry strategy and dead jobs

Background jobs fail. The senior question is not "how do I prevent failure" but "what is the retry strategy for each kind of failure?"

Transient errors (network blip, third-party 503, brief rate limit), retry with exponential backoff. ActiveJob's retry_on handles this declaratively.
Permanent errors (validation failure, record not found, business rule violation), discard the job. discard_on ActiveRecord::RecordNotFound is the canonical Rails idiom.
Unknown errors, let the job's default retry kick in. After N attempts (default ~25 in Sidekiq, configurable in Solid Queue), the job goes to a "dead set" for inspection.

class ChargeOrderJob < ApplicationJob
  # Transient: retry with backoff
  retry_on Stripe::APIConnectionError, wait: :polynomially_longer, attempts: 5
  retry_on Stripe::RateLimitError,     wait: :polynomially_longer, attempts: 8

  # Permanent: log and discard
  discard_on Stripe::InvalidRequestError do |job, error|
    Rails.logger.error("Cannot charge order #{job.arguments.first}: #{error.message}")
  end

  # Specific failure: do something custom
  discard_on Stripe::CardError do |job, error|
    order = Order.find(job.arguments.first)
    order.update!(status: "card_declined", failure_reason: error.message)
  end
end

The combination of retry_on, discard_on, and idempotent perform bodies handles 95% of real-world failure modes cleanly. The dead set absorbs the remaining 5% for human review.

Retry storms

The opposite failure mode is when retries themselves become the problem. A retry storm happens when a downstream service is briefly degraded and a queue full of pending jobs all retry simultaneously, hammering the downstream service further and prolonging the incident.

The fixes:

Exponential backoff with jitter. Rails 7+ ActiveJob ships wait: :polynomially_longer for retries; it includes randomization so retries do not align. Never use a fixed retry interval, that is the textbook recipe for storms.
Circuit breakers. For external services, libraries like stoplight or circuit_breaker let you stop calling the service after N consecutive failures, preserving the queue until it recovers.
Separate queues with different concurrency. Critical work (charging cards) gets a high-priority queue with its own worker pool; less-critical work (analytics events) gets a lower-priority queue that can be paused under load.

What real teams have written

Mike Perham's blog (the author of Sidekiq) is the canonical reference for Rails background-job patterns. His posts on idempotency, retry strategy, and concurrency tuning are how most senior Rails developers learned the discipline.

Shopify runs Sidekiq at one of the largest known Rails-job scales, hundreds of thousands of jobs per second across their fleet. Their engineering posts about partitioning Sidekiq across many Redis instances, and about implementing back-pressure between services, are advanced reading. The pattern worth absorbing from Shopify's writing is that they ran on plain Sidekiq for many years before any of these advanced shapes became necessary; most apps will never need the partitioning.

37signals' blog and conference talks documented the case for Solid Queue. Their pitch is the operational simplicity of "no Redis to operate," especially appealing to teams running smaller fleets where the marginal cost of another service is meaningful. The case is real; Rails 8 making Solid Queue the default is a strong signal that the operational tradeoff has shifted.

GitHub runs almost all of their async work through ActiveJob with multiple backends underneath. Their writing on job idempotency, job-level rate limiting, and the use of perform_later as the default for almost everything that is not a 5ms operation is instructive. The cultural rule at GitHub Engineering is roughly "if it takes more than 100ms, it is a job."

When NOT to use a background job

Two cases:

When the user needs the result immediately. "Charge this card and tell me if it succeeded" cannot run in a background job, because the user is waiting for a yes/no on the next page. Some apps split this: the charge happens synchronously, but everything else (emails, analytics, slack notifications) goes to jobs.
When the work is genuinely fast enough. A 5ms call to update a row does not benefit from being a job; the queueing and worker overhead is larger than the work itself.

The senior heuristic for what belongs in a job: anything that crosses the network (mailer, external API, webhook), anything that takes more than ~100ms, anything that the user does not need to see the result of on the next page. The default for these should be "background job"; the exceptions should be deliberate.

The principle at play

The request cycle is a finite, expensive resource. Every millisecond a request takes is a millisecond of Puma thread occupancy, database connection occupancy, and load-balancer slot occupancy. Moving work out of the request cycle is the highest-impact throughput change in Rails, because it converts a synchronous resource (web threads) into an asynchronous one (worker processes) that can be scaled independently.

The deeper move is recognizing that "the user clicked the button" and "the work is done" are different events. The user clicked a button, that takes 80ms to acknowledge. The work is done, that takes whatever it takes, and the user finds out later via email, websocket update, or refreshing the page. Decoupling those two events is what allows Rails apps to feel fast even when the underlying work is slow.

The pragmatic value: most Rails apps could move 50% of their controller work into jobs and see immediate throughput gains, with no infrastructure change beyond the queue adapter they already have. It is one of the most rewarding scaling moves in the series.

Practice exercise

Grep your controllers for deliver_now. Each one is a candidate to become deliver_later, which queues the mail through ActiveJob and frees the controller.
Grep for external API calls in controllers: Stripe::, Twilio::, HTTParty, Net::HTTP. For each one, ask: does the user need the result on the next page? If not, the call belongs in a job.
Open your slowest controller endpoint in APM. Look at the breakdown: how much of the time is "external HTTP"? That time can almost certainly be moved off the request path.
Pick one job in your codebase. Read its perform method. Ask: if this ran twice with the same arguments, would the result be the same? If not, the job is not idempotent, and you have a latent bug waiting for a retry to surface it.
Bonus: if you are on Rails 8 and using Solid Queue, look at the queue tables in your database. SELECT COUNT(*), queue_name FROM solid_queue_jobs GROUP BY queue_name; tells you which queues are backed up. If you are on Sidekiq, the web UI does the same thing graphically.

← Scaling 5, Caching Scaling 7, Pagination →