Re: Report reorder buffer size - Mailing list pgsql-hackers
From | Bertrand Drouvot |
---|---|
Subject | Re: Report reorder buffer size |
Date | |
Msg-id | aJ24iItBLdGER2eD@ip-10-97-1-34.eu-west-3.compute.internal Whole thread Raw |
In response to | Report reorder buffer size (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>) |
List | pgsql-hackers |
Hi, On Wed, Aug 13, 2025 at 06:09:47PM +0530, Ashutosh Bapat wrote: > Hi All, > The question of how to tune logical_decoding_work_mem has come up a > few times in the context of customer cases. Same here. > If it is too low then the > changes are spilled to disk, slowing down replication and consuming > disk space. In case of streamed transactions, it won't consume space > on primary but will consume resources (transactional resources/disk > space) on replica. If decoding memory is too high, it may cause memory > pressure especially when there are many wal senders and large amounts > of changes queued to reorder buffers frequently. > > In order to tune logical_decoding_work_mem optimally we need a measure > of the size of changes belonging to the transactions queued in the > reorder buffer, whether spilled to the disk or in memory or sent > downstream as part of an unfinished streamed transaction. Fully agree, that's something I started to work on some time ago. Not only to tune the logical_decoding_work_mem but also to help diagnose issues with high memory consumption and/or OOM due to logical decoding. > Currently > there is no way to determine that. pg_stat_replication_slots reports > cumulative statistics about spilled or streamed transactions. It helps > to know whether the logical decoding work memory is sufficient or not > to hold all the changes. But it does not provide a way to estimate an > optimal value for it since it doesn't capture the state of the reorder > buffer at a given point in time. Agree. > With the attached WIP patch, we can report, at a given point in time, > total size of reorder buffer i.e. sum of the size of all the changes > belonging to the transactions active in the reorder buffer whether > those changes are in memory, on disk or sent downstream as part of an > unfinished streamed transaction. By sampling this value at regular > intervals, one can observe the reorder buffer trendline as shown in > the plots attached. In the plots, the Y-axis is the total size of > reorder buffer and the X axis is time of day. > rbsize_trendlines_pgbench.png shows trendline when just regular > pgbench is run with 3K TPS and > rbsize_trendlines_large_transaction_spikes.png shows the trendline > when there are long and large transactions in addition to pgbench. > Using these trendlines, a user may decide to keep > logical_decoding_work_mem to minimum and thus let all the large > transactions spill to disk or be streamed. Right. Also, the issue with spill files is that when they are read back from disk then the reorder buffer can consume a lot of memory. That's why I think tracking the xid, sub xid and LSNs could be useful too. That could help diagnose why it was using that memory. Attached is what I did prepare a couple of years ago. I did not share it until now because it "only" logs information in the log file and I wanted to do more. It's far from being polished and I'm not sure it's 100% correct. I share it just as food for thought about what information I thought could be useful to log. It applies on top of yours. > Tracking total size of reorder buffer also helps to quantify load on a > wal sender. Yeah, and memory being used. > It is an additional metric that can be useful to debug WAL > buildup, slow replication etc. For example, the plot clearly and > directly shows when the wal sender started processing the large > transaction, when it finished etc. Agree that's useful info to have. > The patch is not complete. There are many things to be worked on > 1. Whether the metrics should be reported in pg_stat_replication or > pg_stat_replication_slots. Given that it's a point-in-time metric, it > fits pg_stat_replication better. I do think the same. > But that view does not report logical > decoding activity happening in backends other than the wal senders. > But that's true with the other metrics in that view as well. > pg_stat_replication_slots, however, covers all users of logical > decoding. For now I have used pg_stat_replication, but it can be > changed based on the discussion here. I think pg_stat_replication is a good place to record "real time" activities. Maybe we could log 2 metrics: the reorder buffer size before creating the spill files and the reorder buffer size while reading back the spill files? That would help make the distinction when logical_decoding_work_mem is involved (i.e before creating the spill files) and when it is not (reading them back). pg_stat_replication_slots could be used to add some counters too: like the number of times logical_decoding_work_mem has been reached while decoding from that slot. Also maybe record the max value the reorder buffer reached and the associated xid, number of sub xids, and LSN? That could be useful to diagnose high memory consumption and/or OOM issues due to logical decoding. Also, should we log some of information to the server log? > But before I work on those, I would like to know others' opinions, > usefulness and acceptance. Thanks for working on it. I like the idea and also think that is an area where more details/metrics should be provided. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
pgsql-hackers by date: