Friday, June 6, 2025

Avoiding disk spills due to PostgreSQL's logical replication

Logical replication is a versatile feature offered in PostgreSQL. I have discussed the the theoretical background of this feature in detail in my POSETTE talk. At the end of the talk, I emphasize the need for monitoring logical replication setup. If you are using logical replication and have setup monitoring you will be familiar with pg_stat_replication_slots. In some cases this view shows high amount of spill_txnsspill_count and spill_bytes, which indicates that the WAL sender corresponding to that replication slot is using high amount of disk space. This increases load on the IO subsystem affecting the performance. It also means that there is less disk available for user data and regular transactions to operate. This is an indication that logical_decoding_work_mem has been configured too low. That's the subject of this blog: how to decide the right configuration value for logical_decoding_work_mem. Let's first discuss the purpose of this GUC. Blog might serve as a good background before reading further.

Reorder buffer and logical_decoding_work_mem

When decoding WAL, a logical WAL sender accumulates the transaction in an in-memory data structure called reorder buffer. For every transaction that WAL sender encounters, it maintains a queue of changes in that transaction. As it reads each WAL records, it finds the transaction ID which it belongs to and adds it to the corresponding queue of changes. As soon as it sees a COMMIT record of a transaction, it decodes all the changes in the corresponding queue and sends downstream. If the reorder buffer fills up by transactions whose COMMIT record is yet to be seen, it spills the queue to the disk. We see such disk spills accounted in spill_txnsspill_count and spill_bytes. The amount of memory allocated to reorder buffer is decided by logical_decoding_work_mem GUC. If GUC value is lower, it will cause high disk spills and if the value is higher it will waste memory. Every WAL sender in the server will allocate logical_decoding_work_mem amount of memory, thus the total memory consumed for maintaining reorder buffer is {number of WAL senders} * logical_decoding_work_mem which can go upto max_wal_senders * logical_decoding_work_mem.

Setting logical_decoding_work_mem optimally

It's clear that reorder buffer should be able to hold WAL records of all the concurrent transactions to avoid disk spills. How many concurrent transactions there can be? Every backend in PostgreSQL, client as well as worker can potentially start a transaction and there can be only one transaction active at a given time in a given backend. Thus the higher bound on the number of concurrent transactions in a server is decided by max_connections which decided the maximum number of client backends in the server, max_prepared_transactions which decides the number of prepared transaction in addition to the transactions in client backends, max_worker_processes and autovacuum_max_workers which together decide the backends other than the client backends which may execute transactions. The sum of all these GUCs gives the higher bound on the number of concurrent transactions that can be running at a time in a server. Assuming that average amount of WAL produced by each transaction is known, the total amount WAL that may get added to reorder buffers is {maximum number of transactions in the system} * {average amount of WAL produced by each transaction}. The question is how to find the average?

Transactions by different applications and transactions by worker processes all may have different characteristics and thus produce different amounts of WAL. But they all compete for space in reorder buffer and they all are part of a single WAL stream, which can be examined by pg_waldump. There are a few ways, we can utilize this utility to estimate logical_decoding_work_mem.
  1. Count the number of commits or aborts in a given set of WAL segments and divide the total size of WAL segments by that count. The total size of WAL segments will be {number of WAL segments] * {size of each WAL segment}. If you are seeing transactions being spilled to disk, the total amount of WAL generated by concurrent transactions is higher than logical_decoding_work_mem which by default is 64MB which is equivalent to 4 WAL segments of, default size, 16MB each. So you will need to analyze several WAL segments not just a few.
  2. pg_waldump reports WAL records by transaction. It can be used for a better estimate by sampling typical transactions from pg_waldump and estimating sizes of each such typical transactions and their counts.
  3. Modify pg_waldump to keep a running count of amount of WAL accumulated in reorg buffer. The algorithm would look like below
    1. T = 0
    2. Read a WAL record. If the record belongs to transaction x, Cx = Cx + size of WAL record, where Cx maintains the total size of WAL records of transaction x so far. If x is a new transaction, Cx = size of WAL record
    3. T = T + Cx where T is the total size of WAL records accumulated in reorder buffer when that record was read.
    4. When a COMMIT or ABORT WAL record of transaction x is read, T = T - Cx.

      This way T tracks the size of WAL records accumulated in the reorder buffer at a given point in time. Maximum over T can be used to estimate logical_decoding_work_mem.
  4. If you are not comfortable with C or pg_waldump, above option can be implemented by parsing output of pg_waldump using higher level languages like python.
Once you have estimated the maximum amount of WAL that may get accumulated in the reorder buffer, add about 5% overhead of other data structures with reorder buffer and you have your first estimate of logical_decoding_work_mem. It can be refined further by setting the GUC and monitoring pg_stat_replication_slots.

However, remember that each WAL sender will consume logical_decoding_work_mem amount of memory which may affect the total memory available for the regular server operation. You may find an optimal value which leaves enough memory for regular server operation while reducing the disk spills. Option 3 and 4 would help you with that. If you plot the curve of T against time, you will find memory consumed by the WAL senders in the steady state, eliminating any picks or troughs in memory usage by logical decoding. logical_decoding_work_mem should be configured keeping this steady state consumption in mind.

Even after doing all this, the disk spill is high or there's too much memory consumed by WAL senders, you best bet is to use streaming in-progress transactions by specifying streaming parameter to logical replication protocol. Find more about that in this blog.

If you know other ways to estimate logical decoding work memory or avoiding disk spill, please comment on this blog.