Databases in Production · 4 of 8

Window functions and CTEs without the foot-gun

Running totals, top-N per group, gap detection, the cases where 50 lines of Ruby compress to 4 lines of SQL, and the CTE materialization change in Postgres 12 that decides whether your query plan is great or terrible.

Where this rule comes from

Two SQL features cover most of the cases where Rails developers reach for Ruby loops over query results. Window functions compute values across a set of rows related to the current row, without collapsing the rows into a single aggregate. Common Table Expressions (CTEs) let a query reference a named sub-result, including recursive references for hierarchical data.

Both have been in Postgres since version 8.4 (2009). Neither is exotic. Both show up rarely in Rails codebases because Active Record's vocabulary points at scopes and joins, and the patterns these features enable require dropping into SQL. The shift seniors make is recognising when 50 lines of Ruby that fetch, iterate, and group rows are doing work the database could do in 4 lines.

The other thing that matters is what changed in Postgres 12. CTEs used to be an optimization fence: the planner could not push predicates through them. PG12 changed that, and the rule of when to reach for a CTE shifted with it.

The anti-pattern: ranking in Ruby

A team builds a leaderboard endpoint: "the top 3 posts per user, ordered by score." The Ruby-first approach:

# Fetches every post in the system, then ranks in Ruby
top_three_per_user = Post.order(score: :desc).group_by(&:user_id).transform_values { |posts| posts.first(3) }

On a posts table with 5 million rows, this query reads all 5 million rows from the database, ships them to the Rails process, allocates 5 million Active Record objects, and throws 99% of them away. The wall-clock time is dominated by data transfer and Ruby memory allocation, not by the database itself.

The same problem solved with a window function:

Post.from(<<~SQL).where("rn <= 3")
  (
    SELECT
      posts.*,
      ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY score DESC) AS rn
    FROM posts
  ) ranked
SQL

The database ranks rows in-place, returns only the top 3 per user (a few thousand rows), and finishes in milliseconds. The right place to do per-group ranking is in the engine that already has the rows sorted in pages on disk, not in Ruby after a round-trip.

Window functions for "top N per group"

A window function is a function called over a window of rows. The window is defined by OVER (PARTITION BY ... ORDER BY ...). PARTITION BY divides the rows into groups; ORDER BY sorts within each group. ROW_NUMBER() assigns a sequential integer per row in each partition.

The full list of ranking functions:

ROW_NUMBER(): 1, 2, 3, 4, ... (ties get arbitrary order)
RANK(): 1, 2, 2, 4, ... (ties share a rank, next rank jumps)
DENSE_RANK(): 1, 2, 2, 3, ... (ties share a rank, no gap)
NTILE(n): divides rows into n approximately equal buckets, returns the bucket number

Most leaderboard and top-N-per-group queries use ROW_NUMBER. The senior reflex: when the question is "give me the top N rows per category," reach for a window function before reaching for a loop.

Running totals and gap detection

Beyond ranking, window functions cover two other patterns that show up in product code: cumulative aggregates and adjacent-row comparisons.

Running totals: a balance column computed from a stream of transactions. The Ruby-first approach iterates, accumulates, writes back. The window function:

SELECT
  id,
  amount_cents,
  SUM(amount_cents) OVER (PARTITION BY account_id ORDER BY created_at) AS running_balance
FROM transactions
ORDER BY account_id, created_at;

Postgres walks the partition in order, accumulating the sum. The result is the running balance at every row. No application-level iteration, no memory accumulation in Ruby.

Gap detection: LAG() and LEAD() let a row see the previous or next row's value. Finding logins more than 24 hours apart per user:

SELECT
  user_id,
  created_at,
  created_at - LAG(created_at) OVER (PARTITION BY user_id ORDER BY created_at) AS gap
FROM sessions
WHERE created_at - LAG(created_at) OVER (PARTITION BY user_id ORDER BY created_at) > INTERVAL '24 hours';

Postgres compares each row to its predecessor in the same partition. This kind of query in Ruby would require iterating in chronological order, holding the previous row in memory, and computing the diff. In SQL it is one statement.

CTEs and the materialization fence

A CTE is a named subquery. The syntax:

WITH recent_orders AS (
  SELECT * FROM orders WHERE created_at > NOW() - INTERVAL '30 days'
)
SELECT
  customer_id,
  COUNT(*) AS order_count,
  SUM(amount_cents) AS total_cents
FROM recent_orders
GROUP BY customer_id;

Before Postgres 12, every CTE was materialized: Postgres computed the CTE's full result into a temporary set, then ran the outer query against it. Predicates from the outer query could not push down into the CTE, even when it would have helped. This is the "optimization fence" old Postgres docs reference.

The fence had a use: forcing materialization when you wanted the planner to compute the CTE once and reuse the result. But the cost was that naive CTEs gave bad plans on large tables. Many Postgres teams in the 9.x era avoided CTEs entirely for that reason.

Postgres 12 inverted the default. CTEs are inlined unless declared WITH ... AS MATERIALIZED or referenced multiple times. The planner can now push predicates through, choose join orders that span the CTE boundary, and pick plans as if the CTE were a subquery. CTEs in PG12+ are safe to use for readability without paying a plan tax.

Forcing materialization is still useful in two cases. One: the CTE result is expensive and used multiple times in the outer query, so caching it explicitly pays back. Two: the CTE contains side effects (an INSERT, UPDATE, or DELETE returning rows), in which case materialization is required for correctness.

Recursive CTEs for hierarchical data

Trees and graphs in a relational database are usually modelled with a self-referential foreign key: parent_id on the same table. Walking the tree in Ruby (one query per level) is N+1 from hell. The right tool is a recursive CTE:

WITH RECURSIVE descendants AS (
  SELECT id, parent_id, name, 0 AS depth
  FROM categories
  WHERE id = 42

  UNION ALL

  SELECT c.id, c.parent_id, c.name, d.depth + 1
  FROM categories c
  JOIN descendants d ON c.parent_id = d.id
)
SELECT * FROM descendants;

Postgres starts with the seed query (the row with id 42), then repeatedly joins the recursive query against the accumulated result until it returns no new rows. The output is the full subtree rooted at id 42, with depth annotation.

Comment threads, organisational charts, category hierarchies, dependency graphs: all of these are recursive CTE queries. The alternative in Active Record is fetching one level at a time or using a heavyweight extension like closure_tree. For occasional traversals, the CTE is shorter and faster.

The principle at play

The database has already sorted, grouped, and indexed your data. Shipping rows to the application to do work the database could do is paying twice: once for the data transfer, once for the Ruby-side computation. The senior move is recognising the patterns that compress in SQL and using them.

Three signals that a Ruby loop should be a SQL query:

The loop fetches more rows than it returns. (Filtering, top-N, deduplication.)
The loop computes something per group. (Running totals, per-group ranks, percentiles.)
The loop compares each row to its neighbours. (Gaps, deltas, sequences.)

Each of these has a one-statement SQL form. Active Record's vocabulary does not reach them directly, but raw SQL inside a .from or .find_by_sql does. The cost of dropping into SQL is one block of inline SQL in the code; the benefit is two orders of magnitude on real workloads.

Practice exercise

Find a controller in your app that fetches a large collection, groups by a column in Ruby, and returns the top N per group. Time it under realistic data.
Rewrite it as a SQL query with ROW_NUMBER(). Compare timings. On a 1M-row table, the difference is usually 5-50×.
Find a place in the app that computes a running total or running aggregate in Ruby (a balance, a streak count, a cumulative score). Rewrite with SUM(...) OVER (...).
If your data has any tree shape (categories, comments, organisational hierarchy), write a recursive CTE that returns one subtree. Confirm the EXPLAIN plan is sane (does not do a seq scan per level).

← Indexing beyond btree MVCC, vacuum, and bloat →