โ† Back to Course

Scaling Series ยท 2 of 8

Indexes & EXPLAIN

Where database indexes come from, why a single missing index is the most common Rails scaling problem, and how to read EXPLAIN well enough to fix it.

Where this rule comes from

Database indexes are the oldest scaling technique in computing. The B-tree, the data structure behind most relational-database indexes, was invented at Boeing in 1971 by Rudolf Bayer and Edward McCreight. The point of a B-tree is to let you find any row in a billion-row table with about 30 disk reads, instead of reading the entire table. Indexes have been the highest-impact performance tool in databases ever since.

EXPLAIN arrived next. Every serious database has a command that shows the planner's decisions for a query: which indexes it considered, which one it chose, how many rows it expects to scan, what the cost estimate is. Postgres's version is documented from the early 1990s. Reading EXPLAIN output is the single skill that turns a Rails developer into a database-aware Rails developer.

The reason this matters specifically for Rails: ActiveRecord hides the SQL until something breaks. A junior writes User.where(email: params[:email]).first, ships it, and never thinks about the index. The query runs fast on 100 rows. At 500,000 rows, the same query starts taking 800ms because the database is reading the whole table. The Rails code did not change; the underlying database performance did, because the data shape did.

The senior move: when a query slows down, ask the database why before changing anything else. Indexes are physics: they let the database do less work. Every other scaling technique in this series is downstream of "the database is doing the right amount of work for the question you asked."

The anti-pattern

Picture a Rails app that has been adding features for two years. The team built a "find user by email" lookup, used in sign-in. Sign-in was fast for the first year. At 800,000 users, sign-in starts taking 600ms at p95. Sentry alerts. The team's first instinct is "we need to cache the user lookup in Redis."

# app/controllers/sessions_controller.rb
class SessionsController < ApplicationController
  def create
    user = User.find_by(email: params[:email])  # 600ms at p95
    if user&.authenticate(params[:password])
      session[:user_id] = user.id
      redirect_to root_path
    else
      render :new
    end
  end
end

# db/schema.rb
create_table "users" do |t|
  t.string :email     # ...without an index
  t.string :password_digest
  t.timestamps
end

The cache idea is wrong, and the diagnosis takes thirty seconds. Open psql, run:

EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'alice@example.com' LIMIT 1;

-- output:
-- Seq Scan on users  (cost=0.00..18395.00 rows=1 width=92)
--                    (actual time=312.4..612.8 rows=1 loops=1)
--   Filter: ((email)::text = 'alice@example.com')
--   Rows Removed by Filter: 799,999
-- Planning Time: 0.184 ms
-- Execution Time: 612.819 ms

The bug is in two words: "Seq Scan." Postgres has no index on email, so to find one user it reads every row in the table. 800,000 rows examined to find one. 612ms of wasted database time, on a request that runs thousands of times a day.

The fix is a one-line migration:

class AddIndexOnUserEmail < ActiveRecord::Migration[8.0]
  def change
    add_index :users, :email, unique: true
  end
end

After the migration, the same EXPLAIN looks like this:

-- Index Scan using index_users_on_email on users
--   (cost=0.42..8.44 rows=1 width=92)
--   (actual time=0.034..0.036 rows=1 loops=1)
--   Index Cond: ((email)::text = 'alice@example.com')
-- Planning Time: 0.205 ms
-- Execution Time: 0.058 ms

612ms became 0.058ms. A 10,000x improvement, from a one-line change. No Redis. No cache invalidation logic. No new infrastructure. Only the database doing the right amount of work.

How to read EXPLAIN

EXPLAIN output looks intimidating until you know the four things to read. The senior shortcut: look for these four words first:

  • Seq Scan, the database is reading the whole table. For tables over a few thousand rows, this is the canonical "you need an index" signal.
  • Index Scan, the database is using an index. Good. The query is reading only the rows that match.
  • Index Only Scan, the database is answering the query entirely from the index, without touching the underlying table at all. This is the fastest shape and worth designing for when the same query is hot.
  • Bitmap Heap Scan, the database used an index to find the rows but had to fetch them from the table. Fine; usually fast.

Then check the numbers. Three matter:

  • Rows. Compare "estimated rows" to "actual rows" (with EXPLAIN ANALYZE). If the planner thought it would scan 100 rows but actually scanned 800,000, the table statistics are stale. Run ANALYZE users; and re-EXPLAIN.
  • Execution Time. If it is more than a few milliseconds for a single-row lookup, something is wrong.
  • Rows Removed by Filter. This number tells you how many rows the database read but threw away. If it is large, your WHERE clause is not selective enough on the index, or there is no usable index for it.

In Rails console, you can run EXPLAIN on any ActiveRecord relation. This is the everyday tool for "is this query going to scale?":

puts User.where(email: "alice@example.com").explain

# or for the actual execution plan with real timings:
puts User.where(email: "alice@example.com").explain(:analyze)

# in a migration, you can also use the database task:
#   bin/rails db:analyze

Composite, partial, and covering indexes

Single-column indexes solve most problems. Three other index shapes solve the remaining ones, and intermediate Rails developers often do not know they exist.

Composite indexes cover multiple columns. For a query that filters on two columns together, a composite index is much faster than two separate single-column indexes:

# Slow without a composite index, even with index_on_user_id and index_on_status separately:
Order.where(user_id: u.id, status: "paid").order(created_at: :desc)

# Fast with:
add_index :orders, [:user_id, :status, :created_at]

The order of columns in a composite index matters. The leftmost columns are the ones the index can satisfy: an index on (user_id, status, created_at) can answer queries filtering on user_id alone, or on user_id + status, or on all three, but not a query filtering only on status.

Partial indexes cover only the rows that match a condition. They are smaller, faster to maintain, and ideal for queries that always filter on the same predicate:

# This index only contains rows where status = 'pending'.
# Queries that filter on status='pending' use it; others don't see it.
add_index :orders, :user_id, where: "status = 'pending'", name: "idx_pending_orders_on_user"

# Useful when 99% of orders are 'paid' but you frequently query
# for the 1% that are 'pending', the index is tiny and very fast.

Covering indexes (called "include columns" in Postgres) let an Index Only Scan return columns that are not part of the indexed key, by storing them in the index leaf:

add_index :orders, :user_id, include: [:status, :total_cents]

# Now Order.where(user_id: u.id).select(:status, :total_cents)
# can run entirely from the index, never touching the orders table.

Covering indexes are an advanced move worth knowing about. They are not free, every column in the include list makes the index bigger and updates slower, but for read-heavy hot paths, they can collapse a query from milliseconds to microseconds.

How to find missing indexes in production

Two queries every Rails engineer should know. The first finds missing indexes on foreign keys:

-- Find foreign-key columns without indexes.
-- Almost every match here is a real bug.
SELECT c.conrelid::regclass AS table_name,
       a.attname            AS column_name,
       c.conname            AS fk_constraint
FROM pg_constraint c
JOIN pg_attribute a
  ON a.attrelid = c.conrelid
 AND a.attnum   = ANY(c.conkey)
WHERE c.contype = 'f'
  AND NOT EXISTS (
    SELECT 1 FROM pg_index i
    WHERE i.indrelid = c.conrelid
      AND a.attnum   = ANY(i.indkey)
  );

Every row this returns is a foreign-key column with no index. Rails creates foreign-key constraints automatically through t.references but does not always add the index. Every missing index here turns "find this user's orders" into a sequential scan on the orders table.

The second query, pg_stat_statements from the previous lesson, ordered by total time, tells you which queries are using the most database time across all requests. The top entries are almost always either missing an index, doing an N+1, or returning too many rows.

The combination of these two queries is the senior's "scaling backlog" generator. Run them quarterly. Every match is a real, named, fixable issue. Most of them are one-line migrations.

What real teams have written

Shopify Engineering has published extensively about index strategy at scale. Their post on identifying problematic queries using pg_stat_statements mirrors the diagnostic flow above. Their internal tooling around adding indexes online (without blocking writes) is open-source as lhm and later strong_migrations, both worth reading for the operational discipline behind a migration that runs against billions of rows.

GitHub Engineering's public writing on MySQL at scale repeatedly returns to the same point: when a query slows down, the answer is almost always either a missing index or a query plan that flipped because table statistics drifted. The architectural moves (Vitess, sharding) came much later, after every reasonable single-database move had been made.

Artsy Engineering's posts on Rails performance often open with EXPLAIN output and end with a one-line index addition. The pattern is so common that it is almost the genre: "we had a slow page, we ran EXPLAIN, the plan showed a Seq Scan, we added an index."

Discourse's Rails monolith has hundreds of indexes in db/schema.rb, many of them composite. Reading the schema is instructive: almost every multi-column WHERE clause in the codebase has a matching composite index, designed with the leading-column rule in mind. That discipline is what allows Discourse to host single-instance forums with millions of posts on modest hardware.

When NOT to add an index

Indexes are not free. Each one has costs:

  • Writes get slower. Every INSERT, UPDATE, or DELETE has to update the index. A table with 12 indexes pays 12 index updates per write.
  • Disk space. A composite index on three columns can be larger than the table itself on a wide row.
  • Maintenance. Adding an index on a 100M-row table on a busy database without CONCURRENTLY locks writes for the duration of the build, which can be hours.

The rule of thumb: add an index when there is a real query that needs it, with measured cost, and a plausible read-to-write ratio that favors the index. Do not add indexes "in case." Production database administrators have horror stories about adding too many.

Two Rails-specific operational rules: use algorithm: :concurrently when adding indexes to tables larger than ~100k rows in production (it lets the build happen without blocking writes), and install the strong_migrations gem if your team has not already. It catches dangerous migrations in CI before they hit production.

The principle at play

An index is the database's way of pre-computing an answer to a question the application asks repeatedly. It trades disk space and slightly slower writes for dramatically faster reads on the indexed dimension. When the read-to-write ratio is high (as it is for most Rails apps), the trade pays for itself many times over.

The deeper move is recognizing that "the query is slow" almost always means "the database is doing more work than necessary". EXPLAIN tells you exactly what work it is doing. The diagnosis is rarely subtle once you can read the output. The senior skill is not knowing intricate index theory; it is having the discipline to open psql and run EXPLAIN before doing anything else.

The pragmatic value of this lesson: most "we need to scale!" panics resolve into one or two missing indexes. Adding them takes a few minutes. Not adding them costs you years of trying to outpace a problem that was never about architecture in the first place.

Practice exercise

  1. Run the "missing indexes on foreign keys" query above against your production database. For each row it returns, that is a missing index you can ship today.
  2. Pick your slowest pg_stat_statements entry. Open psql and run EXPLAIN ANALYZE on it. Look for Seq Scan in the output.
  3. If you see a Seq Scan on a table over ~10k rows, figure out which columns the WHERE clause needs an index on. Write the migration with algorithm: :concurrently.
  4. Look at your three slowest endpoints in your APM. For each, find the underlying SQL in the flame chart. Run EXPLAIN ANALYZE on each one. Note the execution time, and what kind of scan the plan is using.
  5. Bonus: read the db/schema.rb of an open-source Rails app you respect (Discourse, Forem, Mastodon). Count the indexes per table. Notice how many composite indexes there are on the highest-traffic tables. That density is what carries the app at scale.
  6. Bonus 2: install strong_migrations in your app. Run a destructive migration in development and watch it refuse to ship. The gem catches several years of operational mistakes that intermediate teams make.