Back to Course

Databases in Production · 2 of 8

The planner and its statistics

Why a query goes from 5ms to 800ms after a release without the SQL changing. What the planner knows about your data, how it goes stale, and the places seniors look first when a plan goes wrong.

Where this rule comes from

Postgres is a cost-based planner. For every query, it enumerates candidate plans, estimates the cost of each, and picks the cheapest. The cost estimate depends on what the planner believes about the data: how many rows are in each table, how selective each predicate is, how the values are distributed.

These beliefs live in two system catalogs. pg_class holds per-table summaries: reltuples (estimated row count), relpages (8KB pages on disk). pg_stats holds per-column distributions: most common values (MCVs), histograms, null fraction, average column width, number of distinct values.

The statistics are sampled, not exact. Postgres reads a sample of rows (default 300 × default_statistics_target = 30,000 rows on a large table) and extrapolates. The sample is refreshed when autovacuum runs ANALYZE on the table, which happens after about 10% of rows have changed by default.

The result: between autovacuum runs, the planner's view of your data can drift away from reality. A query that ran fine yesterday can run badly today because something changed about the data, not the query. Finding that drift is most of the senior's job during a slow-query investigation.

The anti-pattern: blame the SQL

A team gets paged at 9 a.m. The dashboard endpoint that used to return in 80ms is taking 4 seconds. The team opens the controller, looks at the query, and starts rewriting. They add an index. They try a different scope. They check N+1. Nothing helps.

Two hours in, someone runs EXPLAIN ANALYZE and sees the planner is estimating 12 rows where 184,000 are actually returning. The query is fine. The plan is wrong. ANALYZE accounts; takes 8 seconds. The query is back to 80ms. The actual cause was a backfill job that ran overnight, inserting 200,000 new rows, all matching the dashboard's predicate, before autovacuum had a chance to refresh stats.

The lesson is that the SQL was never the problem. The data underneath it changed in a way the planner did not know about. The senior reflex is "check the statistics freshness before you touch the query."

How statistics drive plan choice

The simplest example: a join. Postgres has three join algorithms, each best in a different regime.

  • Nested loop: for each outer row, look up matching inner rows. Cheap when the outer side returns very few rows. Expensive when it returns many.
  • Hash join: build an in-memory hash on the smaller side, probe with the larger. Cheap when both sides are reasonably sized. Memory-hungry if either side is huge.
  • Merge join: both sides arrive sorted, walk through in parallel. Cheap when both sides are already ordered. Expensive when a Sort node has to feed it.

The planner picks among these based on the estimated row count from the outer side. If it expects 5 rows from the outer scan, it picks nested loop. If it expects 50,000, it picks hash join. If the estimate is 5 and the reality is 50,000, the nested loop runs 50,000 inner scans and the query is 10,000× slower than it could have been.

Selectivity estimation is the other half. Given a predicate like WHERE status = 'pending', Postgres looks up the MCV list for the status column. If 'pending' is one of the most common values, it knows the percentage exactly. Otherwise it falls back to the histogram. For unusual values not in the MCV list, the estimate is "fraction of rows not covered by MCVs divided by n_distinct," a flat guess. This is where most selectivity bugs live.

When statistics go stale

Stats go wrong in four common patterns.

1. Bulk inserts between autovacuum runs. A backfill job adds 500,000 rows in 30 seconds. Autovacuum has not run yet. The planner believes the table is its previous size, picks plans based on the old selectivity, and gets every estimate wrong.

2. Deletes that change distribution. A cleanup job removes all soft-deleted rows. The remaining rows have a different distribution of values (different MCVs, different histograms) but Postgres has not re-sampled yet. Queries that filtered on the deleted values now hit a different shape than the planner expects.

3. Migrations that backfill columns. Adding a new column and populating it via a background job creates a brief window where the column exists but Postgres has no histogram for it. Predicates on the new column get the default "1 in 200" estimate, which is almost always wrong.

4. Correlated columns the planner thinks are independent. By default, Postgres assumes columns are statistically independent. A predicate like WHERE country = 'France' AND city = 'Paris' gets estimated as (fraction of France rows) × (fraction of Paris rows). In reality, almost every Paris row is also a France row, so the joint selectivity is the city selectivity alone. Postgres underestimates by a factor of 1/n_countries. This is fixed with extended statistics, covered below.

Extended statistics for correlated columns

Postgres 10 added CREATE STATISTICS for capturing correlations between columns. Three kinds matter for Rails apps:

-- Functional dependencies: city → country, postcode → city
CREATE STATISTICS users_geo (dependencies)
  ON country, city, postcode FROM users;

-- ndistinct: number of distinct combinations
CREATE STATISTICS orders_pair (ndistinct)
  ON customer_id, status FROM orders;

-- MCV across multiple columns
CREATE STATISTICS posts_filter (mcv)
  ON published, category_id FROM posts;

-- Then refresh:
ANALYZE users;
ANALYZE orders;
ANALYZE posts;

After the extended statistics exist and the next ANALYZE has run, the planner uses them for correlated predicates. Selectivity estimates that were off by 100× become accurate. The plans that depended on those estimates start picking the right algorithm.

The signal that you need extended statistics: an EXPLAIN ANALYZE shows the planner estimated a small number on a multi-column predicate but the actual rows were much larger, AND single-column ANALYZE does not fix it. The correlation between the columns is the missing information.

Reading pg_stat_user_tables

Postgres records per-table activity counters in pg_stat_user_tables. The columns that matter for plan investigations:

SELECT
  schemaname,
  relname,
  n_live_tup,
  n_dead_tup,
  last_analyze,
  last_autoanalyze,
  n_mod_since_analyze
FROM pg_stat_user_tables
WHERE schemaname = 'public'
ORDER BY n_mod_since_analyze DESC
LIMIT 20;

n_mod_since_analyze is the count of rows changed (inserted, updated, deleted) since the last ANALYZE. If it is in the millions and last_analyze is hours old, the planner's view is stale. Run ANALYZE relname; and re-explain the slow query. Most of the time the plan changes for the better.

n_dead_tup in the same view is the dead-tuple count, which is the bloat signal. Lesson 5 covers what to do with that.

The principle at play

The planner is only as good as the data it sees about your data. Most surprise slowdowns are not bad SQL or missing indexes; they are a planner working with a stale model of reality. The senior workflow is to verify reality matches the planner's beliefs before changing anything else.

Three checks in order, on any unexplained slow query:

  1. EXPLAIN ANALYZE. Compare estimated rows vs actual rows at each node.
  2. If they diverge: ANALYZE table_name;, then re-explain.
  3. If they still diverge on a multi-column predicate: create an extended statistic, ANALYZE, re-explain.

Only then start touching SQL or indexes. The query was not always the problem.

Practice exercise

  1. Run the pg_stat_user_tables query above against your production database. Sort by n_mod_since_analyze. Identify the three tables most likely to have stale stats.
  2. For each, check last_autoanalyze. If it is hours or days old on a high-write table, autovacuum is not keeping up. The fix is either raising autovacuum aggressiveness on that table or running a scheduled ANALYZE.
  3. Pick a query in your app that filters on two correlated columns (country + city, customer + status, user + workspace). Run EXPLAIN ANALYZE. Compare estimated rows against actual. If the estimate is off by more than 5×, create an extended statistic and re-explain.
  4. Set up an alert on n_mod_since_analyze > 1_000_000 for your busiest tables. The alert is your early-warning that a plan regression is coming.