Scaling Series · 3 of 8

The Concurrency Model

Where Puma's threading model comes from, why the GVL is real but rarely the bottleneck, and the connection pool math that decides how many concurrent users your Rails app can serve.

Where this rule comes from

Ruby has had threads since the 1990s, but its threading model is unusual. MRI (the standard Ruby implementation) has a Global VM Lock, the GVL, that ensures only one Ruby thread executes Ruby code at a time, even on a multi-core machine. The GVL exists to keep the C-extension ecosystem simple: extension authors do not have to make their code thread-safe, because Ruby's interpreter serializes Ruby execution behind the lock.

Most Ruby developers, including most senior ones, hear "GVL" and assume Ruby threads are useless for concurrency. This is wrong, and getting it right is the biggest gain in Rails throughput tuning. The GVL only serializes Ruby code execution. It releases the lock during I/O, when Ruby is waiting on a database query, an HTTP call, a Redis operation, or a disk read. For an I/O-bound application (which most Rails apps are), Ruby threads scale almost as well as separate processes, and they use much less memory.

This insight is what made Puma the default Rails web server. Puma runs multiple processes (forks of the master), and within each process runs multiple threads. Processes scale CPU-bound work; threads scale I/O-bound work. A typical Rails request is mostly I/O (it waits for the database, then renders, then responds), so threads do the heavy lifting. Processes exist to use multiple CPU cores for the parts that are not waiting on I/O.

The senior move: understand which knob to turn before you turn it. Adding processes when threads were the right answer wastes memory. Adding threads when the database is the bottleneck floods the connection pool. Both are common mistakes by intermediate engineers, and both are visible in the right metrics.

The anti-pattern

Picture a Rails team running on Heroku. The app is slow under load. Sentry reports timeouts. Somebody opens config/puma.rb and finds this:

# config/puma.rb
workers ENV.fetch("WEB_CONCURRENCY", 2).to_i
threads_count = ENV.fetch("RAILS_MAX_THREADS", 5).to_i
threads threads_count, threads_count

The default. 2 processes × 5 threads = 10 concurrent requests served per dyno. Someone in Slack says "let's bump RAILS_MAX_THREADS to 25 to handle more concurrency." Twenty minutes later, deployed. Five minutes after that, the app is worse:

Requests are queueing in the database connection pool. ActiveRecord::ConnectionTimeoutError appears in Sentry.
Memory usage on each dyno climbs. R14 (memory quota exceeded) starts appearing.
Throughput goes down, not up. Some requests are now taking twice as long.

The team raised RAILS_MAX_THREADS without raising the database connection pool to match. The pool defaults to 5 (Rails) or to whatever RAILS_MAX_THREADS was at boot. Now 25 threads are trying to share 5 connections; 20 of them wait on the pool semaphore, holding requests open and consuming dyno memory while doing nothing useful.

The deeper anti-pattern is tuning Puma config by copying values from someone else's blog post. The right values depend on three things specific to your app: how much memory each process uses, how I/O-bound a typical request is, and how many database connections the database can sustain. None of those are knowable in someone else's config.

The connection pool math

Two rules connect Puma settings to the database. Get these right and most Rails throughput problems disappear:

Rule 1: the connection pool size must equal (or slightly exceed) RAILS_MAX_THREADS per process.

# config/database.yml
default: &default
  adapter: postgresql
  pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>

# config/puma.rb
threads_count = ENV.fetch("RAILS_MAX_THREADS", 5).to_i
threads threads_count, threads_count

# Each process gets a pool of size RAILS_MAX_THREADS.
# Each thread in the process can claim one connection.
# No thread ever waits on the pool semaphore.

Rule 2: total connections across all dynos and all worker processes must fit within the database's max connections.

# Math for a 6-dyno app:
#   6 dynos × 2 workers × 5 threads = 60 simultaneous connections
#
# Postgres on Heroku Hobby tier: max 20 connections. Math fails.
# Postgres on Heroku Standard-0:  max 120 connections. Math passes.
#
# Sidekiq with concurrency 10 across 2 worker dynos adds 20 more.
# Total: 60 (web) + 20 (workers) = 80, with headroom on Standard-0.

# When you exceed max_connections, new connections fail with
# "FATAL: too many connections for role 'app'", which manifests
# in Rails as ActiveRecord::ConnectionNotEstablished during boot.

This is the math every Rails team should know by heart. Most production incidents I have seen labeled "scaling problem" are connection-pool exhaustion in disguise. The fix is usually a config change, not new hardware.

For very large fleets where the per-process connection cost becomes prohibitive (you cannot fit 60 dynos × 25 connections inside Postgres's max), the next move is PgBouncer, a connection pooler that multiplexes thousands of application connections onto a smaller number of real Postgres connections. PgBouncer is a real architectural addition; do not reach for it until the simpler math breaks.

How to pick the right thread count

Within one Puma process, more threads means more concurrent requests. The right number depends on how much time each request spends waiting on I/O versus executing Ruby.

If 80% of request time is I/O (the typical Rails app, most time is in the database and external calls), the GVL is released for that 80%. You can run 5–10 threads per process without contention.
If 50% of request time is Ruby (heavy ERB rendering, JSON serialization, JSON parsing, in-memory transformation), thread contention starts mattering. 3–5 threads per process is the sweet spot.
If you are CPU-bound (rare for Rails, usually computational endpoints like image processing), threads beyond the CPU count are wasted. Use processes instead.

Your APM can tell you the I/O-to-Ruby ratio. Open one of your average endpoints in the flame chart. The colored segments are the breakdown. If the database and HTTP segments dominate, raise threads. If the Ruby segment dominates, raise processes (or look at YJIT, Ruby 3.3+'s just-in-time compiler that can speed up Ruby code by 20-40% on real Rails workloads).

# config/application.rb (Ruby 3.3+)
# Enable YJIT for production. Typical Rails throughput
# improvement: 15-30%, sometimes more, at minimal memory cost.
config.before_configuration do
  if defined?(RubyVM::YJIT) && Rails.env.production?
    RubyVM::YJIT.enable
  end
end

Processes vs threads, in plain terms

The senior shortcut for explaining this to a teammate:

Threads are cheap. They share memory with their parent process, so each new thread costs ~5MB plus the per-request allocations. A process with 10 threads uses about the same memory as a process with 5 threads plus a bit.
Processes are expensive. Each Puma worker is a full fork of the master, with its own copy of the Rails app loaded in memory. Going from 2 to 4 workers roughly doubles the dyno's memory footprint.
Processes use multiple CPU cores. Threads inside one process share one core's worth of Ruby execution due to the GVL. If you have a 4-core dyno, you want at least 4 workers to use them all.
The good news for Rails: most Rails work is I/O-bound, so the GVL is released most of the time, and threads parallelize well even within one process.

The default Puma config on Heroku (2 workers × 5 threads = 10 concurrent requests on a 2x dyno) is a sensible starting point. Scale workers when you see Ruby segments dominating in the APM (or when you have CPU headroom and memory headroom). Scale threads when you see I/O segments dominating and you have free memory.

What real teams have written

Heroku's documentation on Puma sizing is the canonical Rails reference for this topic, and the math above mirrors theirs. The post titled "Concurrency and Database Connections in Ruby with ActiveRecord" (still findable on Heroku's blog) is the single best primer.

Shopify's engineering team has written multiple posts about running Rails at very high throughput, and the GVL's behavior is a recurring topic. Their advocacy for jemalloc as an alternative Ruby allocator came from production memory pressure on heavily-threaded Puma processes, a real-world artifact of the threading model. They also publish about YJIT performance on their workloads, including measurable throughput gains in production.

GitHub's engineering blog documented their migration from Unicorn (one process, one thread) to Puma (one process, many threads) at considerable scale. The post is instructive: most of the work was not the config change; it was finding and fixing the thread-unsafe code that had accumulated over years of running on a single-threaded server. The lesson: a Rails app that is "thread-safe enough" on Unicorn is not necessarily ready for Puma without an audit.

Common thread-safety landmines

If you are moving to multi-threaded Puma after running single-threaded, these are the places to audit:

Module-level mutable state. MyService.cache = at the class level is a shared mutable hash; multiple threads writing to it race. Use Concurrent::Map from the concurrent-ruby gem, or per-thread storage via Thread.current.
Memoization on constants. @@expensive ||= compute on a class is shared across threads; the ||= is not atomic. Use a Mutex or move the memoization to per-request scope.
Gems with global state. Some older gems hold state in module variables; check release notes for "thread-safe in version X." Recent versions of major gems are thread-safe; legacy ones may not be.
External SDK clients that are not re-entrant. Some older payment-processor SDKs use module-level configuration; setting them per-request from a thread is unsafe. Use thread-local instances.

The one Rails-thread-safety rule that catches 80% of bugs: if you ever use @@variable or write to a module-level mutable object, audit it. Rails itself is thread-safe; the bugs almost always live in application code that did not anticipate multiple threads in the same process.

The principle at play

The GVL is a real constraint, but it operates on a smaller surface than developers usually assume. It only serializes Ruby code execution, and Ruby code is a small fraction of the time a typical Rails request spends running. The bigger fraction, waiting on the database, the HTTP response, the disk, happens with the GVL released, fully concurrent across threads.

The deeper move is recognizing that throughput in Rails comes from doing the math, not from copying configs. The right thread count depends on your app's I/O profile. The right pool size depends on your thread count. The right worker count depends on your CPU and memory. Three numbers, all tied together, all measurable from your APM and your database stats. The senior skill is not "knowing the right Puma config"; it is knowing the relationship between the numbers and adjusting them based on what you measure.

The pragmatic value: most Rails apps are underutilized in their concurrency settings, paying for dynos that sit idle 80% of the time waiting on I/O. Tuning threads correctly can reduce dyno count substantially. It is one of the few scaling moves that costs nothing and saves money immediately.

Practice exercise

Open your app's config/puma.rb and config/database.yml. Check that the database pool: setting equals or exceeds RAILS_MAX_THREADS. If it does not, fix it.
Calculate: total_web_connections = dyno_count × workers × threads. Add Sidekiq concurrency × worker dyno count. Compare to your database's max_connections. Note the headroom (or the lack of it).
In your APM, open the flame chart for your average endpoint. Note the breakdown by segment: SQL %, external HTTP %, view rendering %, Ruby %. The first two tell you how much I/O parallelism you have available.
If you are on Ruby 3.3+ and not running YJIT, enable it and measure throughput before and after. The change is usually visible in p50 latency within minutes.
Bonus: run SELECT count(*), state FROM pg_stat_activity GROUP BY state; against your production database during peak traffic. If you see many idle-in-transaction connections, something in your app is holding the connection while doing non-database work. That is usually a slow Ruby block inside a transaction or an external HTTP call inside one.
Bonus 2: read Heroku's "Concurrency and Database Connections in Ruby with ActiveRecord" post if you have not. The math in this lesson is theirs; the post is one of the few canonical references on the topic.

← Scaling 2, Indexes Scaling 4, Hot Rows & Contention →