Databases in Production · 6 of 8

Transaction isolation and locking

Read committed's lost-update problem, when SELECT FOR UPDATE is correct, advisory locks for cross-request synchronization, reading pg_locks during an incident, and the case for atomic UPDATE over Rails's with_lock.

Where this rule comes from

The SQL standard defines four transaction isolation levels: read uncommitted, read committed, repeatable read, and serializable. Each one weakens or strengthens what concurrent transactions can see and what races can happen between them. Postgres implements three: it treats read uncommitted as read committed (Postgres's MVCC makes the weaker version unnecessary), and adds the standard repeatable read and serializable on top.

The default in Postgres is read committed. Every statement in a transaction sees a fresh snapshot, including changes committed by other transactions since the previous statement. The level prevents dirty reads but leaves three concurrency anomalies in play: non-repeatable reads, phantom reads, and the classic lost-update race.

Most Rails apps run on read committed without ever consciously choosing it. The result is a class of bugs that show up under load and disappear under tests: two requests reading the same row, both computing a new value, both writing, and one write being silently overwritten. This lesson is about recognising that class and the tools for solving it.

The anti-pattern: lost update under read committed

A team builds a counter that tracks the number of views on a post:

# In a controller action
post = Post.find(params[:id])
post.views_count += 1
post.save!

Under normal load this works. Under concurrent load (two simultaneous requests for the same post), the race plays out like this:

Request A reads views_count = 10.
Request B reads views_count = 10.
Request A computes 11 in Ruby and saves.
Request B computes 11 in Ruby and saves.
The counter is 11, not 12. One view was lost.

Read committed does not prevent this. Both reads see the same starting value. Both writes succeed because they both target the same row and the second write does not detect that the row changed between the read and the write. This is the lost-update anomaly, and it is the most common concurrency bug in Rails apps.

The fix is not "use a transaction." A transaction wraps the read and write atomically from the perspective of crash recovery, but read committed still lets the two transactions see each other's stale snapshots. The fix is to either lock the row before reading it or to let the database do the increment atomically. Three tools, each appropriate for different cases.

Atomic UPDATE: the simplest answer

For the counter case, the right answer is rarely locking. It is letting the database do the arithmetic:

Post.where(id: params[:id]).update_all("views_count = views_count + 1")

Postgres takes a row-level lock on the matching row during the UPDATE, reads the current value, adds 1, writes it back. No read-then-write-in-Ruby race. Two simultaneous UPDATEs serialise on the row lock, and the counter ends at 12. No application code is doing the addition.

Active Record's increment_counter uses this exact pattern. Post.increment_counter(:views_count, post_id) compiles to the same UPDATE. The pattern generalises beyond counters: any update that can be expressed as a SQL expression over the current row should be done as an atomic UPDATE, not read-modify-write in Ruby.

The senior reflex: when a piece of code reads a value, computes a derivation, and writes it back, ask whether the derivation can be expressed in SQL. If yes, atomic UPDATE is the right shape and locking is unnecessary.

SELECT FOR UPDATE for read-modify-write

When the modification cannot be expressed as a SQL expression (it involves business logic, branching, calls to external services), pessimistic locking is the right tool. Rails wraps it with with_lock:

Post.find(params[:id]).with_lock do |post|
  raise "Post is locked" if post.locked_at
  raise "Quota exceeded" if post.user.over_quota?

  post.update!(approved: true, approved_at: Time.current)
end

Behind the scenes, with_lock opens a transaction and runs SELECT * FROM posts WHERE id = ? FOR UPDATE. The FOR UPDATE clause takes a row-level lock that blocks any other transaction from updating or deleting the same row until this transaction commits. The block runs with the locked row in post; the transaction commits at the end, releasing the lock.

The cost is throughput. While the lock is held, every other request for this row waits. Holding a row lock across a slow external API call (a Stripe charge, a Mailchimp sync) means every concurrent request for the same row sits behind it. The rule: do not call out to external services inside a with_lock block. Take the lock, do the in-memory work, commit, then call out.

Two variants worth knowing. FOR UPDATE SKIP LOCKED skips rows that are already locked by another transaction instead of waiting. Useful for job queues: each worker picks the next unlocked job. FOR NO KEY UPDATE is a weaker lock that allows other transactions to take foreign-key locks on the row, which avoids some lock contention in busy parent-child relationships.

Advisory locks for cross-request synchronization

Row locks protect rows. Some synchronizations are not about a single row: "only one job processes this user at a time," "only one cron run happens at once across all servers," "only one user can edit this document concurrently." For these, Postgres offers advisory locks: named application-level locks the database tracks but does not enforce against any particular table.

# Take a session-scoped advisory lock keyed by user_id
ActiveRecord::Base.connection.execute("SELECT pg_advisory_lock(#{user_id})")
begin
  # ... do work that should be exclusive per user
ensure
  ActiveRecord::Base.connection.execute("SELECT pg_advisory_unlock(#{user_id})")
end

The lock is held until the matching release call or until the session ends. Other sessions calling pg_advisory_lock with the same key wait until the lock is released. pg_try_advisory_lock is the non-blocking variant: returns true if the lock was acquired, false if not. Useful for "if someone else is already doing this, skip."

The with_advisory_lock gem wraps this in a Rails-friendly block. Production uses include: making sure a billing job runs once per customer per day even across multiple worker processes, ensuring a slow migration script does not run twice on overlapping deploys, serialising a global recompute that must not double-run.

Two variants matter. pg_advisory_xact_lock takes a transaction-scoped lock that auto-releases on commit or rollback. This is usually what you want for short critical sections: less risk of leaking a lock if Ruby raises. pg_advisory_lock is session-scoped, useful for longer-running coordination but requires careful release.

Reading pg_locks during an incident

When the app gets paged for "database connections at limit, queries hanging," the first place to look is pg_locks joined with pg_stat_activity. The query that finds blocking sessions:

SELECT
  blocked.pid       AS blocked_pid,
  blocked.query     AS blocked_query,
  blocking.pid      AS blocking_pid,
  blocking.query    AS blocking_query,
  blocking.state    AS blocking_state,
  now() - blocking.xact_start AS blocking_age
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
  ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE blocked.wait_event_type = 'Lock';

Each row is a "this query is waiting on that query." The blocking query's age tells you how long it has held the lock. The pattern to watch for: one slow query holding a row lock, dozens of fast queries piling up behind it. Killing the blocking session (SELECT pg_cancel_backend(pid);) often clears the cascade in seconds.

The other thing to spot is deadlocks: two transactions each holding a lock the other needs. Postgres detects them automatically and aborts one of them with ActiveRecord::Deadlocked (or PG::TRDeadlockDetected). A deadlock log entry means the application took locks in inconsistent order between transactions. Most fixes are about ordering: always lock parent before child, always lock by id ascending, never take two locks in a loop without thinking about the order.

When to graduate to repeatable read or serializable

Read committed plus row-level locking handles most concurrency cases. The two stronger levels exist for the cases where they do not.

Repeatable read guarantees that the same SELECT inside a transaction returns the same rows every time. Useful for reports that span multiple queries and need a consistent snapshot. The cost is that concurrent UPDATEs to rows your transaction touched will fail with could not serialize access, which means application-level retry on the error.

Serializable guarantees that the result of all concurrent transactions is equivalent to running them one at a time in some order. The strongest level; the most expensive. Postgres detects conflicts and aborts losing transactions. The pattern: wrap critical sections in a serializable transaction, catch ActiveRecord::SerializationFailure, retry with backoff.

The decision rule: stay on read committed unless a specific anomaly is biting you and atomic UPDATE plus SELECT FOR UPDATE cannot fix it. Stronger isolation has real cost in retries, and the retry loops have to be correct. Most teams never need repeatable read or serializable; the teams that do almost always reach for it specifically rather than as a default.

The principle at play

Three tools, three problem shapes:

Atomic UPDATE when the new value is a SQL expression of the old. Counters, increments, decrements, status transitions.
SELECT FOR UPDATE (with_lock) when the new value depends on application logic. Approval flows, complex state machines, multi-field updates that read each other.
Advisory locks when the synchronization is not about a row. Cross-request coordination, per-tenant job exclusivity, "only one of these can run at a time" across servers.

The senior reflex is recognising which shape the problem has before reaching for a tool. Most "we need a lock" instincts dissolve into "we need an atomic UPDATE." Most "the database is hung" investigations resolve into "one session is holding a lock too long." Knowing the three tools, and which one fits which shape, is most of the concurrency story in a Postgres-backed Rails app.

Practice exercise

Grep your codebase for += 1, -= 1, or any read-then-update on a numeric column. Each one is a candidate lost-update bug. Rewrite as atomic UPDATE.
Find with_lock blocks in the codebase. For each, check whether the block calls out to an external service. If yes, refactor: take the lock, do in-memory work, commit, then call out.
Identify a piece of work that should run "only one at a time" across servers (a daily report, a backfill, a long-running calculation). Wrap it in pg_try_advisory_lock. Confirm that running two copies in parallel results in one acquiring the lock and the other no-op-ing.
Save the pg_locks query above somewhere accessible. Next time the database is slow, run it first. The blocking-session pattern is the single most common operational cause of an "everything is hung" page.

← MVCC, vacuum, and bloat Zero-downtime migrations →