Re: Report reorder buffer size - Mailing list pgsql-hackers

From Bertrand Drouvot
Subject Re: Report reorder buffer size
Date
Msg-id aJ24iItBLdGER2eD@ip-10-97-1-34.eu-west-3.compute.internal
Whole thread Raw
In response to Report reorder buffer size  (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>)
List pgsql-hackers
Hi,

On Wed, Aug 13, 2025 at 06:09:47PM +0530, Ashutosh Bapat wrote:
> Hi All,
> The question of how to tune logical_decoding_work_mem has come up a
> few times in the context of customer cases.

Same here.

> If it is too low then the
> changes are spilled to disk, slowing down replication and consuming
> disk space. In case of streamed transactions, it won't consume space
> on primary but will consume resources (transactional resources/disk
> space) on replica. If decoding memory is too high, it may cause memory
> pressure especially when there are many wal senders and large amounts
> of changes queued to reorder buffers frequently.
> 
> In order to tune logical_decoding_work_mem optimally we need a measure
> of the size of changes belonging to the transactions queued in the
> reorder buffer, whether spilled to the disk or in memory or sent
> downstream as part of an unfinished streamed transaction.

Fully agree, that's something I started to work on some time ago. Not only
to tune the logical_decoding_work_mem but also to help diagnose issues with high
memory consumption and/or OOM due to logical decoding.

> Currently
> there is no way to determine that. pg_stat_replication_slots reports
> cumulative statistics about spilled or streamed transactions. It helps
> to know whether the logical decoding work memory is sufficient or not
> to hold all the changes. But it does not provide a way to estimate an
> optimal value for it since it doesn't capture the state of the reorder
> buffer at a given point in time.

Agree.

> With the attached WIP patch, we can report, at a given point in time,
> total size of reorder buffer i.e. sum of the size of all the changes
> belonging to the transactions active in the reorder buffer whether
> those changes are in memory, on disk or sent downstream as part of an
> unfinished streamed transaction. By sampling this value at regular
> intervals, one can observe the reorder buffer trendline as shown in
> the plots attached. In the plots, the Y-axis is the total size of
> reorder buffer and the X axis is time of day.
> rbsize_trendlines_pgbench.png shows trendline when just regular
> pgbench is run with 3K TPS and
> rbsize_trendlines_large_transaction_spikes.png shows the trendline
> when there are long and large transactions in addition to pgbench.

> Using these trendlines, a user may decide to keep
> logical_decoding_work_mem to minimum and thus let all the large
> transactions spill to disk or be streamed. 

Right. Also, the issue with spill files is that when they are read back from disk
then the reorder buffer can consume a lot of memory. That's why I think tracking
the xid, sub xid and LSNs could be useful too. That could help diagnose why
it was using that memory.

Attached is what I did prepare a couple of years ago. I did not share it until
now because it "only" logs information in the log file and I wanted to do more.

It's far from being polished and I'm not sure it's 100% correct. I share
it just as food for thought about what information I thought could be useful
to log. It applies on top of yours.

> Tracking total size of reorder buffer also helps to quantify load on a
> wal sender.

Yeah, and memory being used.

> It is an additional metric that can be useful to debug WAL
> buildup, slow replication etc. For example, the plot clearly and
> directly shows when the wal sender started processing the large
> transaction, when it finished etc.

Agree that's useful info to have.

> The patch is not complete. There are many things to be worked on

> 1. Whether the metrics should be reported in pg_stat_replication or
> pg_stat_replication_slots. Given that it's a point-in-time metric, it
> fits pg_stat_replication better.

I do think the same.

> But that view does not report logical
> decoding activity happening in backends other than the wal senders.
> But that's true with the other metrics in that view as well.
> pg_stat_replication_slots, however, covers all users of logical
> decoding. For now I have used pg_stat_replication, but it can be
> changed based on the discussion here.

I think pg_stat_replication is a good place to record "real time" activities.

Maybe we could log 2 metrics: the reorder buffer size before creating the spill
files and the reorder buffer size while reading back the spill files? That would
help make the distinction when logical_decoding_work_mem is involved (i.e before
creating the spill files) and when it is not (reading them back).

pg_stat_replication_slots could be used to add some counters too: like the
number of times logical_decoding_work_mem has been reached while decoding from
that slot. Also maybe record the max value the reorder buffer reached and 
the associated xid, number of sub xids, and LSN? That could be useful to diagnose
high memory consumption and/or OOM issues due to logical decoding.

Also, should we log some of information to the server log?

> But before I work on those, I would like to know others' opinions,
> usefulness and acceptance.

Thanks for working on it. I like the idea and also think that is an area where
more details/metrics should be provided.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: Shubham Khanna
Date:
Subject: Re: Add support for specifying tables in pg_createsubscriber.
Next
From: Nazir Bilal Yavuz
Date:
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD