Databases in Production · 5 of 8

MVCC, vacuum, and bloat

What UPDATE actually does, why a 10GB table can hold 200MB of live data, when autovacuum is enough, when to intervene, and the VACUUM FULL vs pg_repack call that catches teams out.

Where this rule comes from

Postgres uses MVCC (Multi-Version Concurrency Control) to handle concurrent reads and writes. The core idea: instead of writers and readers locking each other out, every transaction sees a consistent snapshot of the data as of when it started. Writers do not block readers, readers do not block writers.

The mechanism is per-row versions. Every row carries a xmin (the transaction that inserted it) and xmax (the transaction that deleted or updated it, or null if still live). When you read a row, Postgres checks these against the snapshot's "what transactions count as visible" set. Old transactions see old versions; new transactions see new ones.

The consequence: UPDATE does not overwrite anything. It writes a new row and marks the old one dead (sets xmax). DELETE does not free space either; it sets xmax and moves on. Until no running transaction can still see the old version, the dead tuple sits in the table taking up disk space.

The cleanup process that eventually reclaims the space is vacuum. Autovacuum runs it in the background based on a threshold of changed rows. The rest of this lesson is about what happens when autovacuum keeps up, what happens when it does not, and how to fix the damage.

The anti-pattern: "why is my SELECT slow on a table I freshly updated?"

A team runs a bulk update at 2 a.m. to populate a new column on a 5-million-row orders table. The migration finishes in 12 minutes. The next morning, queries against orders are slower than yesterday. Nothing about the queries or the indexes changed.

What happened: the bulk update wrote 5 million new row versions and marked 5 million old ones dead. The table is now twice its previous size on disk. Every seq scan reads twice as many pages. Every index scan does twice as many heap fetches because the index points at old tuple locations that have to be checked against the visibility map.

The fix is vacuum, but autovacuum will not get to it for hours. It is throttled by default and runs one table at a time. Until it does, the table is bloated and queries are slow.

The senior reflex on a freshly-updated table is to run a manual VACUUM ANALYZE orders; immediately after the bulk update. That reclaims the dead-tuple space and refreshes the stats in one pass. Production speed comes back within minutes.

How autovacuum works

Autovacuum is a worker pool that wakes up at intervals (default: every minute) and checks each table for two conditions:

Dead tuples: when n_dead_tup exceeds autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor × n_live_tup (default: 50 + 0.2 × live rows). On a million-row table, this means roughly 200,000 dead tuples before vacuum kicks in.
Stale statistics: similar threshold for changed rows, triggering ANALYZE instead of VACUUM.

When a worker starts, it scans the table, marks reusable space in the free space map, updates the visibility map, removes index entries pointing at dead tuples, and updates stats. It does not reduce the table's size on disk. Reclaimed space becomes available for new inserts within the existing pages; it does not return to the OS. Disk usage stays the same until the table grows into it.

Autovacuum is throttled by default: autovacuum_vacuum_cost_limit caps the IO it does per round, pausing when the cap is hit. This keeps it from disrupting live traffic. The downside is that on a high-write table, autovacuum can fall behind. n_dead_tup grows faster than autovacuum can clean it.

Recognising bloat in production

Three signals show up in the catalogs.

1. n_dead_tup vs n_live_tup in pg_stat_user_tables. A ratio over 0.2 means autovacuum has fallen behind. A ratio over 1.0 means there is more dead than live data. Query:

SELECT
  schemaname,
  relname,
  n_live_tup,
  n_dead_tup,
  round(n_dead_tup::numeric / nullif(n_live_tup, 0), 2) AS dead_ratio,
  last_autovacuum
FROM pg_stat_user_tables
WHERE n_live_tup > 10000
ORDER BY dead_ratio DESC NULLS LAST
LIMIT 20;

2. Table size vs row count. Compare pg_total_relation_size(relname) to n_live_tup × average_row_width. If the disk size is much larger than the expected size, the table is bloated.

3. pgstattuple for the exact picture. An extension that reads every page and reports live/dead/free space precisely. Expensive on a huge table, but the most accurate measure. Use sparingly during investigations.

When to intervene

Most bloat resolves itself: autovacuum catches up overnight, queries speed back up by morning. Intervene when:

The dead-tuple ratio is over 1.0 on a table that matters to live traffic, and autovacuum is visibly throttled (last_autovacuum hours ago, n_dead_tup still rising).
A bulk UPDATE or DELETE has finished. Run VACUUM ANALYZE table_name; manually to skip the autovacuum lag.
Disk space is constrained and table size needs to actually shrink (not only become reusable). Manual VACUUM does not reclaim disk; VACUUM FULL or pg_repack does.

Tuning autovacuum aggressiveness per table is the long-term fix on high-write tables:

-- More aggressive autovacuum on a specific hot table
ALTER TABLE events SET (
  autovacuum_vacuum_scale_factor = 0.05,
  autovacuum_vacuum_cost_limit   = 2000
);

The first knob lowers the trigger threshold (vacuum when 5% of rows are dead, not 20%). The second raises the IO cap so the vacuum finishes faster. Both are per-table so they do not punish autovacuum on tables that are fine.

VACUUM FULL vs pg_repack

Regular VACUUM does not return disk to the OS. If a table actually needs to shrink (after a one-time big delete, or after years of bloat), the two options are VACUUM FULL and pg_repack.

VACUUM FULL rewrites the table from scratch, removing dead tuples and packing the survivors into fewer pages. It takes an AccessExclusiveLock on the table for the entire duration. Running VACUUM FULL on a 100GB production table will lock it for an hour or more. Every query that touches the table waits. Production effectively goes down.

The "we ran VACUUM FULL on the orders table during the maintenance window and it ran into business hours" story is one of the most consistent Postgres operational pains. The rule: VACUUM FULL is a maintenance-window-only tool, on tables you can afford to lock.

pg_repack is the production-safe alternative. It is a Postgres extension that rebuilds the table online: creates a shadow table, copies live tuples while tracking ongoing changes via a trigger, swaps the new table in under a brief lock at the end. Total lock time is seconds, not hours. Disk usage temporarily doubles during the rebuild, which is the main constraint.

For a Rails app's regular bloat maintenance: pg_repack. For a one-time shrink during a planned downtime window: either, with VACUUM FULL being slightly faster and not requiring the extension. The senior call is never running VACUUM FULL unintentionally during business hours.

The principle at play

Postgres trades disk space for concurrency. Every UPDATE writes a new row version; every DELETE leaves a tombstone. The trade pays off because reads and writes do not block each other and snapshots are consistent without coarse locking. The cost is ongoing cleanup work and the bloat that accumulates when cleanup falls behind.

The senior view is treating autovacuum as a workload to provision for, not a background detail that handles itself. High-write tables need tuning. Bulk migrations need manual VACUUM ANALYZE at the end. Disk-shrinking needs pg_repack, not VACUUM FULL. Recognising the bloat signal in pg_stat_user_tables before it shows up as user-visible slowness is the difference between a calm Monday and a 9 a.m. incident.

Practice exercise

Run the pg_stat_user_tables bloat query above against production. Identify the top 5 bloated tables (highest dead-tuple ratio).
For each, check last_autovacuum. If it is more than a few hours old on a hot table, tune autovacuum aggressiveness with the ALTER TABLE ... SET pattern above. Re-check the next day.
The next time you ship a bulk UPDATE migration, add a VACUUM ANALYZE table_name; at the end of the migration. Watch for the production read speed not regressing.
Install pg_repack on a staging database. Repack a bloated table. Measure before and after. If you ever reach for VACUUM FULL in production, this is your alternative.