โ† Back to Course

Scaling Series ยท 8 of 8

Bulk Operations & Memory

Where Ruby's memory model breaks down at scale, why find_each beats each on tables you care about, and how Shopify's jemalloc work points at the deeper truth that ActiveRecord is the wrong tool for bulk work.

Where this rule comes from

ActiveRecord was designed for the common case of CRUD requests: find one user, update a few attributes, save. Every record is materialized as a Ruby object. The cost of that materialization, allocating a hash, parsing column values, running type casts, initializing instance variables, is invisible at one record. At a million records, it dominates everything else the application is doing.

Two specific phenomena meet here. The first is Ruby's per-object memory cost: every ActiveRecord instance lives in the Ruby heap, uses an object slot, and contributes to garbage-collection pressure. Loading a million Order records means a million allocated objects, typically multiple gigabytes of memory, and a long GC pause when Ruby tries to clean them up.

The second is Ruby's allocator behavior under thread pressure. Default glibc malloc behaves badly when many threads allocate simultaneously, fragmenting memory and never returning it to the OS. Shopify's engineering team identified this as a major source of "memory keeps growing" complaints in Puma-with-threads deployments. Their fix, jemalloc, is now standard practice for high-traffic Rails apps, including the Rails 8 deploy templates that ship jemalloc enabled by default.

The senior rule for bulk work: ActiveRecord is for individual records, not batches. When you are processing thousands or millions of rows, exports, imports, recomputations, migrations, ActiveRecord's per-record costs become the bottleneck. The fix is either to batch-iterate carefully (find_each), to skip materialization entirely (update_all), or to drop down to raw SQL.

The anti-pattern

Picture a Rails app where the team builds a "send all users a notification" feature. The first version is the obvious one:

class SendDailyDigestJob < ApplicationJob
  def perform
    User.where(daily_digest: true).each do |user|
      DailyDigestMailer.send(user).deliver_later
    end
  end
end

This works for the first 10,000 users. At 500,000 subscribed users, the job hits R14 (out-of-memory) on the worker dyno within sixty seconds and dies.

Three things went wrong:

  • User.where(...).each loads every match into memory before iterating. The SQL is a single SELECT; ActiveRecord materializes every returned row into a User object; then the .each begins. For 500,000 users, that is ~500MB of Ruby objects before the first email is queued.
  • Garbage collection cannot recover the memory mid-iteration. The User objects are all reachable through the array .where returned. GC cannot free any of them until the entire array goes out of scope.
  • The 500,000 mailer jobs all enqueue at once. Even after the loop finishes, the queue tier now has half a million pending jobs. Sidekiq or Solid Queue handle this fine, but the spike in queue depth can cascade, workers, database connections, mail-sending rate limits all hit at once.

find_each and in_batches

The Rails fix has been in the framework since 2.1: find_each. Instead of one giant SELECT, it issues paginated queries, a thousand records at a time by default, and yields each record. The Ruby heap holds only the current batch, not the whole result set.

class SendDailyDigestJob < ApplicationJob
  def perform
    User.where(daily_digest: true).find_each(batch_size: 500) do |user|
      DailyDigestMailer.send(user).deliver_later
    end
  end
end

# Underlying behavior:
#   batch 1: SELECT ... ORDER BY id LIMIT 500
#   batch 2: SELECT ... ORDER BY id WHERE id > [last] LIMIT 500
#   batch 3: SELECT ... ORDER BY id WHERE id > [last] LIMIT 500
#   ... etc.
#
# Memory cost: ~500 user objects in scope at any time.
# The batches before are GC-eligible as soon as the next batch starts.

The default batch_size of 1000 is fine for most use cases. For very wide tables (rows with many large columns), drop it to 100-500. For very narrow tables, raise it to 5000. The metric to watch is the worker's memory growth between batches, if memory plateaus, the batch size is fine.

For operations on whole batches at a time, use in_batches, which yields a relation instead of individual records. This is useful when you want to do bulk operations on each batch:

# Mark a million users as inactive in batches:
User.where(last_seen_at: ...30.days.ago).in_batches(of: 5_000) do |batch|
  batch.update_all(active: false, deactivated_at: Time.current)
end

# Each batch becomes one UPDATE statement on 5,000 rows.
# No materialization, no callbacks, no validations.
# About 100x faster than .find_each + .update! per record.

update_all and delete_all: the bypass valves

Sometimes you do not need ActiveRecord at all. You have a bulk update or delete, and the callbacks and validations are not relevant (or actively in the way). update_all and delete_all drop straight to SQL:

# Update every matching row in one SQL statement.
# No model instantiation, no callbacks, no validations.
Order.where(status: "pending", created_at: ...30.days.ago)
     .update_all(status: "expired", expired_at: Time.current)

# Same for deletes. delete_all skips ActiveRecord destroy callbacks;
# destroy_all runs them per record.
StalePostView.where("viewed_at < ?", 1.year.ago).delete_all

The tradeoff is real and worth being explicit about: skipping callbacks means skipping audit logs, search reindexing, denormalized counter updates, and anything else that lives in after_update. If those side effects need to happen, you cannot use update_all; you need find_each with the per-record callbacks. The decision is "what side effects matter for this batch?" Not every bulk update has side effects worth running.

For audit-trail-shaped side effects, a common pattern is to emit a single audit event for the bulk operation, not one per record. The senior version of this is: BulkExpirationAudit.create!(record_ids: order_ids, expired_at: ...) after the update_all. One row per bulk operation is a fundamentally different correctness model than one row per affected record, but for most audit purposes it is the right model.

insert_all and upsert_all

Rails 6 added insert_all and upsert_all, which perform bulk inserts in one SQL statement. For importing CSVs or data feeds, the speedup over .create! in a loop is typically 100x or more:

# Slow: one INSERT per record, all wrapped in callbacks/validations.
csv_rows.each do |row|
  Product.create!(sku: row["sku"], name: row["name"], price: row["price"])
end
# 10,000 rows: ~2 minutes, with 10,000 callbacks fired

# Fast: one INSERT statement, no callbacks, no validations.
Product.insert_all(csv_rows.map { |r|
  {
    sku:        r["sku"],
    name:       r["name"],
    price:      r["price"],
    created_at: Time.current,
    updated_at: Time.current
  }
})
# 10,000 rows: ~1 second, atomic, transactional

# Upsert: insert or update on conflict.
Product.upsert_all(
  csv_rows.map { |r| ... },
  unique_by: :sku  # uses the unique index on sku
)

Two operational gotchas: insert_all does not populate created_at/updated_at for you (you have to pass them explicitly), and it does not run validations. For data you generated yourself or pulled from a trusted source, that is fine; for user-submitted data, run validations first or accept the risk.

Ruby memory: jemalloc and MALLOC_ARENA_MAX

After you have applied the ActiveRecord-level fixes, find_each, update_all, bulk inserts, there is a deeper layer of memory tuning that comes from how Ruby itself allocates memory.

Ruby on Linux uses glibc's malloc by default. Under threaded workloads (Puma with threads, Sidekiq with concurrency), glibc creates one memory "arena" per thread to reduce contention. The arenas are not shared, so each one allocates from its own pool. For a Puma worker with 5 threads on a 4-core machine, glibc creates 8 arenas (typically 8 ร— ncores by default), each holding its own allocations. Memory is rarely returned to the OS, even after objects are garbage-collected. Process RSS climbs and climbs.

Two well-known fixes:

1. MALLOC_ARENA_MAX=2. Restrict glibc to two arenas total. Per-thread allocation contention goes up slightly, but memory growth slows dramatically. This is the cheapest fix, one environment variable, no code change. Heroku and Render apply this by default in many of their templates.

2. jemalloc. Replace glibc malloc with jemalloc, a different allocator that handles thread workloads better. Shopify documented their migration to jemalloc and saw 10-30% reduction in memory usage and noticeably better latency consistency. Rails 8's deploy templates ship jemalloc enabled by default in the Dockerfile.

# Dockerfile (Rails 8 default)
RUN apt-get install --no-install-recommends -y libjemalloc2
ENV LD_PRELOAD=libjemalloc.so.2

# Result: process RSS grows more slowly under load,
# and returns memory to the OS more aggressively.
# No application code changes required.

These are background-level optimizations, they do not fix application bugs, but they make memory-hungry application code much more forgiving. For most apps, the order is: fix the bulk-operation code first (the application change), then enable jemalloc (the infrastructure change), then start watching for the deeper issues.

When ActiveRecord is the wrong tool

For some operations, even find_each is too slow. The cost of materializing objects is unavoidable as long as you go through ActiveRecord. For maintenance scripts, complex aggregations, or large-scale data transformations, raw SQL via ActiveRecord::Base.connection.execute is the right answer:

# Recompute denormalized counters across the whole table.
# This would take ~30 minutes with .find_each + counter update.
# Raw SQL: ~30 seconds.
ActiveRecord::Base.connection.execute(<<~SQL)
  UPDATE users
  SET    posts_count = (
    SELECT COUNT(*) FROM posts WHERE posts.user_id = users.id
  ),
         updated_at  = NOW()
  WHERE  users.posts_count IS NULL
SQL

The cost is real: no ActiveRecord callbacks, no validations, no fancy Rails error handling. You are talking to the database directly. For maintenance operations on data you control, that is usually the right tradeoff.

The senior heuristic: if a bulk operation will run on more than ~100k records, prototype it in raw SQL first and see how fast it is. Often the raw-SQL version is fast enough that you do not need ActiveRecord's batching at all. The database is much faster at bulk operations than Ruby is; using ActiveRecord for them is leaving performance on the table.

What real teams have written

Shopify's engineering blog has multiple deep posts on Ruby memory at scale. Their migration to jemalloc is documented in detail; their work on MALLOC_ARENA_MAX tuning, their use of compaction in Ruby 3.x, and their internal SamplingProfiler tooling for production memory profiling are all worth reading. The recurring theme is that memory is the limiting factor in their Rails fleet far more often than CPU is, and the optimizations that matter are usually allocation reductions in hot paths.

GitHub Engineering's posts on running Ruby at scale focus heavily on memory and process lifecycle. They migrated to jemalloc years before it was a Rails default. Their tooling for "what is keeping this memory alive?", heap dumps analyzed offline, is the kind of work very few teams need, but reading how they approach the question is instructive even for teams operating at much smaller scale.

Artsy Engineering's posts on Rails import/export work consistently document the journey from "the obvious .each + .create! code" through "find_each made it survive memory limits" to "raw SQL made it finish in minutes instead of hours." That progression is the canonical Rails-bulk-work learning curve.

When bulk-tuning is overkill

  • Operations on small collections. If you are iterating 100 records to send 100 emails, .each is fine. The bulk-optimization patterns earn their cost only when there are thousands of records.
  • When callbacks must run. Audit trails per record, search reindexing per record, cache invalidation per record. update_all skips all of these. If they are real correctness requirements, you have to use find_each, accept the slower pace, and probably batch the work across multiple jobs.
  • When the operation is rare enough. A monthly maintenance script that takes 10 minutes to finish is fine. Optimizing it to take 30 seconds adds complexity that is not paying for itself.

The senior heuristic for bulk work: if a job is processing more than ~10k records, default to find_each. If it is processing more than ~100k, ask whether update_all or insert_all would do. If it is more than a million and the callbacks are not required, drop to raw SQL. The right tool depends on the scale and the side-effect requirements; ActiveRecord is rarely the right tool at the top of that scale.

The principle at play

ActiveRecord is an Object-Relational Mapper, with emphasis on "Object." Its job is to make individual records feel like Ruby objects, complete with callbacks, validations, and a rich method API. That model is enormously productive at the per-record level and enormously expensive at the per-batch level. The cost of one materialization is invisible; the cost of a million is the bottleneck.

The deeper move is recognizing that the database is much better at bulk operations than Ruby is. SQL is designed for set operations: "update every row where X." Ruby is designed for individual transformations: "for each row, do Y." Aligning the operation with the tool, set operations to SQL, individual transformations to Ruby, is the structural insight. The bridge tools (update_all, insert_all) are Rails' way of letting you stay in Ruby while using the database's bulk-native path.

The pragmatic value: most "this script takes hours" reports resolve into "this script materializes a million ActiveRecord objects and updates them one by one." Knowing the alternatives, find_each for memory, update_all for speed, raw SQL for the largest cases, is what makes the same scripts finish in minutes. The hardware did not get faster; the code got closer to what the database was designed to do.

Practice exercise

  1. Grep your codebase for \.each do right after a .where: grep -rn "\\.where.*\\.each" app/. For each match, ask whether the result set is bounded. If not, it is a future memory bomb.
  2. Pick your largest table. Find one rake task or job that iterates over it. Add a memory profile: puts "RSS: #" before and after. Notice the delta.
  3. Look for places in your code that do bulk updates: a loop over records updating each one. Try replacing with update_all. Benchmark before and after.
  4. If your app does CSV imports, replace any create!-per-row loops with insert_all on batches of 1,000. Time the difference.
  5. Bonus: check whether your production Dockerfile installs jemalloc and sets LD_PRELOAD. If not, the Rails 8 default Dockerfile is a one-line copy that will likely reduce memory growth in your fleet immediately.
  6. Bonus 2: read Shopify's posts on Ruby allocation profiling. Their internal SamplingProfiler approach is overkill for most teams, but the framing of "which lines of code are allocating which objects" is the senior mental model for memory work.

Closing the Scaling series

Across eight lessons, this series has worked through the canonical scaling moves in a Rails app's life: diagnose with measurement before changing anything; add indexes when the database is doing too much work; tune the concurrency model so threads and processes match the workload; design around hot rows when concurrent writes contend; cache only after the underlying code is correct; move work to background jobs to free the request cycle; switch to cursor pagination before offsets collapse; use bulk operations when ActiveRecord becomes the bottleneck.

Each lesson was a different bottleneck, found by measurement, fixed by a specific Rails-shaped tool. The order matters: most apps go through them in roughly this sequence as they grow. A team that jumps ahead, sharding before indexing, caching before measuring, usually finds themselves having to come back and do the earlier work later, after the architectural detour has cost time and complexity.

The deeper insight across all eight: "scaling Rails" is mostly not about new infrastructure or new architecture. It is about understanding the Rails primitives well enough to know when each one is reaching its limit, and reaching for the next primitive deliberately. The teams that scale Rails to enormous workloads, Shopify, GitHub, Basecamp, Artsy, Mastodon, all do this work in roughly this sequence, with roughly these tools, and write about it publicly because the patterns generalize.

The senior skill, finally, is recognizing the next bottleneck before it shows up in PagerDuty. Each lesson in this series is a thing to watch for in your APM: slow query at p99, thread pool nearing capacity, lock waits in pg_stat_activity, cache stampedes on the hour, queue depth climbing, pagination latency growing with page number, memory creeping up between deploys. Knowing which signal corresponds to which lesson, and which lesson's fix is the right one, is the actual senior craft. Everything else is plumbing.