Scaling Series · 4 of 8

Hot Rows & Contention

Where race conditions on counter caches come from, why the increment! you ship in year one is the deadlock in year three, and how senior Rails apps handle concurrent writes to the same row.

Where this rule comes from

Relational databases were designed in the 1970s with one of the strongest correctness guarantees in computing: ACID transactions. The "I" stands for Isolation, concurrent transactions cannot see each other's intermediate state. Postgres, MySQL, and every other major database invest enormous engineering effort to make Isolation work efficiently.

Isolation is not free. When two transactions try to update the same row at the same time, the database has to pick one to wait. For most rows, this is invisible, the lock is held for microseconds and the second transaction waits a moment. But certain rows in every app are hot: dozens, sometimes hundreds, of transactions per second all want to write to them. Counter caches. Inventory rows. Vote totals. The "current week's leaderboard" record. When a row gets hot, the small per-write lock wait stacks up into seconds of latency, then deadlocks, then production incidents.

Ruby developers learn early that user.posts_count += 1 is wrong because of the lost-update problem. The senior version of this knowledge goes further: even when you use increment! correctly, certain rows become bottlenecks at scale, and the fix is structural, not "be more careful".

This lesson covers the three concrete shapes of hot-row contention in Rails, how to spot them in production, and the patterns senior codebases use to avoid them. The thread that ties them together: concurrent writes to the same row are physics, not philosophy. You cannot wish them away with better code; you have to design around them.

The anti-pattern (three flavors)

Flavor one: the read-modify-write race. Every Rails developer encounters this once.

# In a controller or a job:
user = User.find(user_id)
user.posts_count += 1   # read happens here
user.save!              # write happens here

# Between read and write, another request can run the same code.
# Both end up writing the same +1, and one is lost.
# The counter drifts. It never recovers.

The fix is well-known: use atomic SQL.

# Atomic at the database level:
User.where(id: user_id).update_all("posts_count = posts_count + 1")

# Or the Rails wrapper that does the same:
user.increment!(:posts_count)

This is the well-known fix that most Rails developers reach for it. The hard problems start one level up.

Flavor two: contention from many writers on one row.

# A heavily-trafficked feature: every post view increments
# the post's view count.
class PostsController < ApplicationController
  def show
    @post = Post.find(params[:id])
    @post.increment!(:view_count)
    # ... render
  end
end

Atomic SQL fixes the lost update. It does not fix the lock contention. Every view of the same post takes a row-level lock on the post for the duration of the UPDATE. At 100 views per second on the same post, the UPDATEs serialize, each one waits for the previous one to commit. The database starts logging lock waits. pg_stat_activity fills with rows in the wait_event = "transactionid" state. Latency on every endpoint touching that post climbs.

Flavor three: deadlock from update order.

# Transferring money between accounts:
class Transfer
  def call(from_id:, to_id:, amount:)
    ApplicationRecord.transaction do
      from = Account.find(from_id)
      to   = Account.find(to_id)
      from.update!(balance: from.balance - amount)
      to.update!(balance: to.balance + amount)
    end
  end
end

# Two simultaneous transfers in opposite directions:
#   Transfer A -> B locks A, then waits to lock B
#   Transfer B -> A locks B, then waits to lock A
# Postgres detects the deadlock, aborts one transaction.
# Application code sees PG::TRDeadlockDetected, raised at random.

Each of these is a different shape of the same underlying problem: shared mutable state in the database, written from multiple concurrent transactions. The fixes are different, but they all live in the structural-design space, not in "write more careful Ruby."

The fixes

Fix 1: optimistic locking for read-then-write logic that is not pure increments.

# Add a lock_version column to the table:
class AddLockVersionToOrders < ActiveRecord::Migration[8.0]
  def change
    add_column :orders, :lock_version, :integer, default: 0, null: false
  end
end

# Rails automatically uses it to detect concurrent updates:
order = Order.find(id)
order.status = "shipped"
order.save!
# If another transaction updated this row between find and save,
# ActiveRecord raises StaleObjectError.
# The caller catches it and retries the operation from scratch.

Optimistic locking is the right tool when writes are rare enough that retries are cheap. It is the wrong tool for hot rows where every request would retry; you would merely be shifting the contention into application code.

Fix 2: pessimistic locking with consistent ordering for true transfers.

class Transfer
  def call(from_id:, to_id:, amount:)
    # Always lock in a consistent order to prevent deadlocks.
    first_id, second_id = [from_id, to_id].sort
    ApplicationRecord.transaction do
      Account.lock.where(id: [first_id, second_id]).load
      from = Account.find(from_id)
      to   = Account.find(to_id)
      from.update!(balance: from.balance - amount)
      to.update!(balance: to.balance + amount)
    end
  end
end

The trick is the .sort on the IDs. By always locking the lower ID first, all concurrent transfers, regardless of direction, request locks in the same order. Deadlocks become impossible because the cycle in the lock graph cannot form.

Fix 3: aggregation outside the hot row for high-traffic counters.

# Instead of incrementing the post's view_count on every request,
# write to a side table sharded by time:
class PostViewsController < ApplicationController
  def create
    PostView.create!(post_id: params[:post_id], viewed_at: Time.current)
    head :ok
  end
end

# A scheduled job aggregates the side table into the post's
# view_count every few minutes:
class AggregatePostViewsJob < ApplicationJob
  def perform
    PostView.where("viewed_at < ?", 5.minutes.ago).group(:post_id).count.each do |post_id, count|
      Post.where(id: post_id).update_all("view_count = view_count + #{count.to_i}")
    end
    PostView.where("viewed_at < ?", 5.minutes.ago).delete_all
  end
end

The hot row becomes a hot table. Inserts to a table do not contend on the same row; they get separate rows. The aggregation step does the contended update, but it does it once per post per interval instead of once per request. Throughput on the hot row goes from "100 increments per second, serialized" to "1 update per five minutes, free."

Fix 4: counter_cache when you can use it.

Rails' built-in counter_cache uses atomic SQL underneath. For typical many-to-one relations (a Comment belongs to a Post, Post has a comments_count), it is the right tool:

class Comment < ApplicationRecord
  belongs_to :post, counter_cache: true
end

class Post < ApplicationRecord
  has_many :comments
end

# On Comment create/destroy, Rails runs:
#   UPDATE posts SET comments_count = comments_count + 1 WHERE id = ?
# Atomic. Safe under concurrency.

counter_cache is the senior-friendly default as long as the row is not so hot that even the atomic UPDATE becomes contended. When it does, typically at hundreds of writes per second on the same parent, switch to the aggregation pattern in fix 3.

How to spot hot rows in production

Three signals:

1. Lock waits in Postgres. The query that reveals the active wait graph:

-- Who is waiting on whom right now?
SELECT pid, wait_event, query, state
FROM pg_stat_activity
WHERE wait_event = 'transactionid' OR wait_event LIKE 'Lock%';

-- The queries that show up here repeatedly are your hot rows.

2. Deadlock errors in your error tracker. Sentry, Honeybadger, or Bugsnag will pick up ActiveRecord::Deadlocked and PG::TRDeadlockDetected exceptions. Any of these in production is a structural problem that needs a structural fix, not a retry loop.

3. Latency spikes correlated with a specific record. If your APM shows that a particular endpoint's p99 latency spikes whenever traffic spikes, and the endpoint touches the same row each time (post view, leaderboard, popular product), that row is hot. The fix is to move the write off the hot row.

What real teams have written

Shopify's IdentityCache gem (open-sourced) is partly a response to hot-row contention. By caching frequently-read parent records in memcached, IdentityCache reduces the number of SELECTs that hit the hot row in the first place, which reduces lock contention indirectly. The lesson worth absorbing is not "use IdentityCache", it is that Shopify hit this problem at scale and the architectural answer was to read around the database, not to write into it differently.

Artsy Engineering's public posts on counter caches walk through the same evolution: started with naive user.update(count: count + 1), hit data drift, migrated to atomic SQL, hit lock contention on hot artworks, moved aggregation to a separate side table. The blog post traces a single counter through three rewrites over two years, which is approximately the real-world timeline for any team hitting this class of problem.

GitHub's engineering writing on hot-row contention focuses on the issue of "hot rooms" in their notification system, a small number of repositories whose notification rows received vastly more writes than the median. Their architectural answer involved partitioning: routing writes for the hottest repositories to dedicated tables that could be tuned for their write pattern. Most Rails apps will never need this; the lesson is that even GitHub's solution started with measuring exactly which rows were hot, then designing around them, not preemptively partitioning everything.

When to ignore this lesson

For apps under 100 writes/sec on the busiest row, hot-row contention is almost certainly not your problem. Use counter_cache, use Rails' default mechanisms, and do not over-engineer. The patterns in this lesson are for the day you can actually point at a row in pg_stat_activity and see queue depth.

For most apps, the order of concern is: missing indexes, then N+1 queries, then synchronous I/O, then everything else. Hot rows usually show up after the first three are clean, when the app is fast enough that lock contention becomes visible.

One operational rule that catches many hot-row bugs early: never hold a database transaction open across a network call. If your transaction includes SomeMailer.deliver_now or Stripe::Charge.create, you are holding a row lock for the duration of an external HTTP call. That single mistake explains a large fraction of production lock-wait incidents in Rails apps. Move external calls outside the transaction (or to after_commit) and many hot-row issues disappear.

The principle at play

Concurrency on shared state is one of the oldest problems in computing, and the database has spent fifty years working on it. The patterns in this lesson are not Rails-specific; they are translations of database-systems theory into Rails idiom. Optimistic locking is the version-clock pattern. Lock ordering is deadlock-avoidance. Aggregation tables are sharding-in-miniature. None of them are new ideas; they are old ideas that show up exactly when an app reaches the scale where the textbook problems become real.

The deeper move is that row contention is not a code bug; it is a design constraint imposed by the database's correctness guarantees. The database is doing its job, keeping the data consistent under concurrent writes. The application's job is to design data structures so that hot rows do not happen, and when they do happen, to provide aggregation points that absorb the load.

The pragmatic value: most apps that have a hot-row problem hit it in production with no warning, because the contention does not appear until traffic crosses a threshold. The senior who has internalized this lesson watches for the early signs (pg_stat_activity waits, occasional deadlock errors) and ships a structural fix before the second incident. The intermediate who has not internalized it adds retries until the deadlocks become "expected behavior."

Practice exercise

Run grep -rn "+= 1\\|-= 1" app/ | grep -i "\\.update\\|\\.save" in your codebase. Every match is a potential lost-update race. Replace each with increment!/decrement! or atomic update_all.
Run the pg_stat_activity wait query against your production database during peak traffic. If any queries appear with wait_event = transactionid repeatedly, you have a hot row. Note the table.
For each transaction in your codebase that includes external HTTP calls or mailer sends, audit it. Move the external work to after_commit or a background job. This single change is the most impactful thing you can do for lock-wait reduction.
If you use Rails' counter_cache on a heavily-trafficked relationship, check whether the parent record's update rate exceeds ~50 per second. If yes, plan the migration to an aggregation table.
Bonus: search your Sentry/Honeybadger for ActiveRecord::Deadlocked or StaleObjectError. Each one is a real structural issue, not a transient bug.

← Scaling 3, Concurrency Model Scaling 5, Caching →