Scaling Series · 1 of 8

Diagnose Before You Scale

Where "we need to scale!" goes wrong, why the bottleneck is almost never what an intermediate developer thinks it is, and the tools senior engineers actually open first.

Where this rule comes from

Donald Knuth wrote the line every engineer eventually quotes in 1974: "Premature optimization is the root of all evil." The full sentence matters more than the famous truncation: "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%." The 3% is real. The trouble is finding it.

Brendan Gregg, the performance engineer behind Netflix's observability work, formalized the discipline into a single rule in 2012, the USE Method: for every resource (CPU, memory, network, disk, database connections), check three things, Utilization, Saturation, Errors. The point of the method is not the acronym; it is that you check resources before you change code. If you do not know which resource is saturated, no code change you make will reliably help.

The Rails community arrived at the same rule through painful experience. Most "we need to scale!" pull requests in Rails history have one common shape: somebody guessed, somebody added infrastructure, the symptom moved. Adding Redis without measuring. Sharding a database that was fine. Switching to microservices to "scale the team" while the actual bottleneck was a missing index.

The first rule of senior scaling work, then, is: do not start. Do not change architecture, do not add infrastructure, do not refactor for performance until you know exactly which resource is saturated, on which request, at which percentile, on which path. The senior move that separates intermediates from seniors here is mostly not doing things until the diagnosis is in.

The anti-pattern

Picture a Rails team three years into product-market fit. Traffic has grown 10x in eighteen months. Pages feel slower. Sentry has more timeout reports than it used to. The team Slack lights up with "we need to scale." Within a week, someone has proposed:

"Let's add Redis caching everywhere."
"Let's move the heavy stuff to Sidekiq."
"Let's switch to PostgreSQL connection pooling with PgBouncer."
"Let's evaluate sharding the orders table."
"Let's extract the billing service into its own monolith."

Each of these is a real technique. Each of them might be the right answer. None of them are knowable as the right answer without diagnostics. The team that adopts any of them blindly has a 1-in-5 chance of helping the symptom, and a 4-in-5 chance of moving the problem somewhere else, often somewhere worse.

The deeper anti-pattern is that "scaling" is being used as a single word for at least three completely different problems:

Latency: one request is slow. Fix: faster queries, better caching, fewer round trips.
Throughput: the server cannot handle the number of concurrent requests. Fix: more concurrency, better connection pooling, more dynos.
Data growth: a query that used to be fast is no longer fast because the table grew. Fix: indexes, pagination, partitioning, archival.

A senior knows which one they have before they reach for tools. "We need to scale" is the question, not the answer. The answer is one of these three, with a specific resource named.

The senior diagnostic flow

Three layers of tooling, used in this order:

1. The APM dashboard. Whatever your team pays for, Datadog, New Relic, Skylight, Honeycomb, AppSignal, Scout, Sentry's transaction view. Every one of these answers the same first question in roughly thirty seconds: which endpoints are slow at p95, and where are they spending their time?

The APM dashboard breaks every request into a flame chart of segments: time spent in the database, time spent in external HTTP calls, time spent in view rendering, time spent in Ruby itself. One look at the chart tells you which resource is saturated. If 90% of the time is in the database, you have a query problem. If 80% is in an external HTTP call, you have a synchronous-IO problem. If the chart is mostly Ruby with no segments, you have a CPU or allocation problem.

If you do not have an APM tool, install one before you do anything else. The biggest factor in how fast a Rails team can diagnose performance issues is whether they can see flame charts of real production traffic. Skylight has a free tier for small apps; Honeycomb has free seats; Sentry's transaction view is included with most plans. The cost of running blind is much higher than the cost of any of these.

2. pg_stat_statements. If the APM points at the database, the next stop is the Postgres extension that records every query the database has run, sorted by total time. It tells you in one query which SQL statements are eating the most cumulative database time across all requests:

-- in psql, against your production replica (read-only is fine)
SELECT query, calls, total_exec_time, mean_exec_time, rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

-- the column to watch is total_exec_time:
-- a query called 1,000,000 times at 2ms each costs more than
-- a query called 100 times at 200ms each.

The top ten rows in that query are almost always your scaling backlog. Each one is either missing an index, doing an N+1, returning too many rows, or running on a request path where it should not be running at all.

3. rack-mini-profiler in development. When you have isolated a specific endpoint, add rack-mini-profiler to your Gemfile in development. It puts a little speed badge in the top-left of every page, showing per-request SQL count, total SQL time, Ruby time, and view rendering time. It catches N+1s before they ship, because most N+1s are obvious from the SQL count: 200 queries on a page that should run 4.

# Gemfile (development group)
gem "rack-mini-profiler"
gem "memory_profiler"   # optional: deeper allocation profiles
gem "stackprof"          # optional: flame graphs of CPU time

# config/initializers/mini_profiler.rb (development only)
if Rails.env.development?
  Rack::MiniProfiler.config.position = "top-left"
  Rack::MiniProfiler.config.start_hidden = false
end

These three tools, in this order, will diagnose 90% of Rails performance issues. The remaining 10% need flame graphs (stackprof), allocation profiles (memory_profiler), or heap dumps, which are the next layer down. For an intermediate developer leveling up to senior, the three above are the discipline to learn first.

Where the bottleneck actually lives (in order of likelihood)

After diagnosing dozens of production Rails apps, performance engineers converge on the same prioritized list. The bottleneck is almost always one of these six, in roughly this frequency order:

A single missing index. One specific query, run thousands of times per minute, doing a sequential scan on a table that grew past 100k rows. The fix is a one-line migration that adds an index. This is the #1 most common scaling problem in Rails apps.
An N+1 query. A page that lists 50 records and runs 51 queries (one for the list, 50 for the associations). The fix is .includes(:association) or its variants. Bullet detects most of these in development.
A synchronous external HTTP call. A controller action that calls Stripe / Mailchimp / Slack in-line, where the request waits for the round-trip. Fix: move to a background job, or pre-compute the result.
Memory bloat from loading too many records. A report that calls Order.all.each and loads 500,000 rows into Ruby objects. Fix: find_each with a sensible batch size, or aggregate in SQL.
Hot-row contention. Multiple processes updating the same counter cache or the same row at the same time, with last-writer-wins semantics or unnecessary serialization. Fix: atomic SQL increments, or denormalized counter strategies.
Connection pool exhaustion. The Puma threads or Sidekiq workers outnumber the available database connections, and requests queue up waiting for one. Fix: tune the connection pool size or scale Puma/Sidekiq differently.

Notice what is not on this list: Redis. Sharding. Microservices. Read replicas. Those are real techniques, but they belong further down, typically only after the six items above have been ruled out. The senior heuristic: if a query plan or an index does not fix it, ask whether you are looking at the right bottleneck before reaching for infrastructure.

What real teams have written publicly

The Rails engineering blogs at scale share a common pattern: almost every public "how we scaled" post starts with a diagnostic discovery, not an architectural decision.

Shopify Engineering publishes regularly on Rails performance at the scale of millions of requests per minute. Their posts on identifying slow queries with pg_stat_statements, on Ruby memory profiling with jemalloc, and on the Identity Cache pattern (which is a hot-row solution) all open with measurement, not architecture.
GitHub Engineering wrote extensively about their migration to Vitess for database sharding, a famously large architectural move. The most important sentence in those posts is that they ran on a single MySQL primary until they had measured exactly why a single primary was no longer enough. The architecture was the last resort, not the first one.
Artsy Engineering has a long-running blog with multiple posts on Rails N+1 fixes, slow query diagnosis, and migrating individual endpoints to GraphQL for selective optimization. Their posts on finding the slow code consistently include screenshots of New Relic or Skylight, measurement first, fix second.

The pattern across all three is the same. The teams that ship reliable scaling work measure first, fix the specific thing the measurement pointed at, ship, and measure again. The teams that try to "scale" without measuring usually generate a year of architecture work that does not move the needle.

The moving bottleneck

One more thing worth understanding before any of the other lessons in this series make sense. Scaling is not a state your app reaches. It is a series of bottlenecks, each one revealed when the previous one is removed.

A typical Rails app's lifetime of scaling work looks like:

Year 1: a missing index. Add the index, latency drops 10x.
Year 1.5: an N+1 on the dashboard. Add .includes, latency drops 4x.
Year 2: a synchronous Stripe call in a critical request. Move to a background job, p99 drops.
Year 2.5: connection pool exhaustion at 11 AM EST (newsletter spike). Tune the pool, traffic drains.
Year 3: hot-row contention on a counter cache. Switch to atomic SQL increments, lock waits disappear.
Year 3.5: a memory leak in a specific background job. Find it with memory_profiler, restart workers no longer needed.
Year 4: the bottleneck is now read traffic on the primary. Introduce a read replica.

Each one of those is a different fix, found by measuring the symptom at the time. A team that had jumped to "introduce a read replica" in year 1 would have wasted twelve months and still had the underlying missing-index problem in year 3. The order matters. The order is dictated by what is actually saturated, not by what you read about in the latest engineering blog post.

The senior heuristic: the bottleneck you are trying to fix today is not the bottleneck you will be fixing in a year. Build the muscle of diagnosis, not the muscle of any specific fix. The diagnostic skill transfers across every layer; the fix-specific knowledge has a much shorter half-life.

When you can skip the measurement

Two cases where "measure first" is overkill:

Adding an index when you wrote the slow query yourself. If a senior reviewer reads your PR and says "you need an index on orders.user_id" because they read the SQL, the measurement is the SQL itself. Ship the index.
Obvious correctness fixes that happen to be fast. Replacing Order.all.each with find_each does not need a benchmark; the second form is unambiguously better at any scale where it matters.

The rule of thumb: measurement is mandatory before infrastructure changes, before architectural changes, before introducing new dependencies. For one-line correctness fixes that are also faster, you do not need a benchmark to justify them.

The principle at play

Engineering has a famous bias: action is more rewarding than inaction. Adding caching feels like progress. Sharding feels like progress. Discovering that the actual problem was a missing index feels almost embarrassing, there was nothing to architect, nothing to talk about at the next meetup. That asymmetry is where most wasted scaling work comes from.

The deeper move is one Brendan Gregg has been writing about for a decade: your intuition about where the bottleneck is, in a system with millions of moving parts, is almost always wrong. The intuition was formed in apps of a different size, with a different shape, on a different stack. The system you are diagnosing today does not behave the way the system in your memory did. Only measurement is reliable; intuition is a starting hypothesis at best.

The pragmatic value of this lesson, the one that will save your team months of work over a year: when someone says "we need to scale," respond with "where is the bottleneck?" before agreeing to anything. If they cannot answer with a specific endpoint, a specific resource, a specific percentile, then the next step is to measure, not to plan. That single conversational reflex is one of the most reliable markers of a senior engineer in a Rails shop.

Practice exercise

Pick your app's slowest endpoint at p95. If you do not know which one it is, you have identified the most important thing to install: an APM. Skylight, Honeycomb, Sentry, or AppSignal all have free or cheap tiers.
Once you know the endpoint, open its flame chart in the APM. Look at the segment breakdown. Write down what percentage of the time is in: SQL, external HTTP, view rendering, Ruby itself. Those four numbers tell you which lesson in this series applies to your situation.
Run SELECT query, calls, total_exec_time FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10; against your production database (read replica is fine). Read the top 10 queries. Each one is a candidate for a scaling lesson.
Add rack-mini-profiler to your development Gemfile. Browse your own app for fifteen minutes. Note every page where the SQL count is above 20 or the total time is above 200ms. Those are the pages with N+1s, missing indexes, or other fixable issues.
Bonus: read one post from the Shopify, GitHub, or Artsy engineering blogs about a real scaling diagnosis. Pay attention to how much of the post is measurement, and how much is fix. The ratio is usually 70/30 in favor of measurement, in real engineering work.

← Course Home Scaling 2, Indexes & EXPLAIN →