Thread: [BUG]: the walsender does not update its IO statistics until it exits

[BUG]: the walsender does not update its IO statistics until it exits

From

Bertrand Drouvot

Date:

25 February, 16:42:08

Hi hackers,

while doing some tests for [1], I observed that $SUBJECT.

To observe this behavior on master:

1. create a logical replication slot

postgres=# SELECT * FROM pg_create_logical_replication_slot('logical_slot', 'test_decoding', false, true);
  slot_name   |    lsn
--------------+------------
 logical_slot | 0/40749508
(1 row)

2. create a table and add some data

postgres=# create table bdt (a int);
CREATE TABLE
postgres=# insert into bdt select a from generate_series(1,10000) a ;
INSERT 0 10000

3. starts pg_recvlogical that way

pg_recvlogical -d postgres -S logical_slot -f - --no-loop --start

4. query pg_stat_io

postgres=# select backend_type,object,context,reads,read_bytes from pg_stat_io where backend_type = 'walsender';
 backend_type |    object     |  context  | reads | read_bytes
--------------+---------------+-----------+-------+------------
 walsender    | relation      | bulkread  |     0 |          0
 walsender    | relation      | bulkwrite |     0 |          0
 walsender    | relation      | init      |     0 |          0
 walsender    | relation      | normal    |     6 |      49152
 walsender    | relation      | vacuum    |     0 |          0
 walsender    | temp relation | normal    |     0 |          0
 walsender    | wal           | init      |       |
 walsender    | wal           | normal    |     0 |          0
(8 rows)

The non zeros stats that we see here are due to the pgstat_report_stat() call in
PostgresMain() but not to the walsender decoding activity per say (proof is that
you can see that the wal object values are empty while it certainly had to read
some WAL).

5. Once ctrl-c is done for pg_recvlogical then we get:

postgres=# select backend_type,object,context,reads,read_bytes from pg_stat_io where backend_type = 'walsender';
 backend_type |    object     |  context  | reads | read_bytes
--------------+---------------+-----------+-------+------------
 walsender    | relation      | bulkread  |     0 |          0
 walsender    | relation      | bulkwrite |     0 |          0
 walsender    | relation      | init      |     0 |          0
 walsender    | relation      | normal    |     9 |      73728
 walsender    | relation      | vacuum    |     0 |          0
 walsender    | temp relation | normal    |     0 |          0
 walsender    | wal           | init      |       |
 walsender    | wal           | normal    |    98 |     793856
(8 rows)

Now we can see that the numbers increased for the relation object and that we
get non zeros numbers for the wal object too (which makes fully sense).

With the attached patch applied, we would get the same numbers already in
step 4. (means the stats are flushed without the need to wait for the walsender
to exit).

Remarks:

R1. The extra flush are done in WalSndLoop(): I believe this is the right place
for them.
R2. A test is added in 035_standby_logical_decoding.pl: while this TAP test
is already racy ([2]) that looks like a good place as we don't want pg_recvlogical
to stop/exit.
R3. The test can also be back-patched till 16_STABLE as 035_standby_logical_decoding.pl
has been introduced in 16 (and so do pg_stat_io).
R4. The test fails if the extra flushs are not applied/done, which makes fully
sense.

[1]: https://www.postgresql.org/message-id/flat/Z3zqc4o09dM/Ezyz%40ip-10-97-1-34.eu-west-3.compute.internal
[2]: https://www.postgresql.org/message-id/Z6oRgmD8m7zBo732%40ip-10-97-1-34.eu-west-3.compute.internal

Looking forward to your feedback,

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

v1-0001-Flush-the-IO-statistics-of-active-walsenders.patch

Re: [BUG]: the walsender does not update its IO statistics until it exits

From

Michael Paquier

Date:

26 February, 09:37:10

On Tue, Feb 25, 2025 at 01:42:08PM +0000, Bertrand Drouvot wrote:
> Now we can see that the numbers increased for the relation object and that we
> get non zeros numbers for the wal object too (which makes fully sense).
>
> With the attached patch applied, we would get the same numbers already in
> step 4. (means the stats are flushed without the need to wait for the walsender
> to exit).

@@ -2793,6 +2794,12 @@ WalSndLoop(WalSndSendDataCallback send_data)
         if (pq_flush_if_writable() != 0)
             WalSndShutdown();

+        /*
+         * Report IO statistics
+         */
+        pgstat_flush_io(false);
+        (void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
+
         /* If nothing remains to be sent right now ... */
         if (WalSndCaughtUp && !pq_is_send_pending())
         {

That's bad, worse for a logical WAL sender, because it means that we
have no idea what kind of I/O happens in this process until it exits,
and logical WAL senders could loop forever, since v16 where we've
begun tracking I/O.

A non-forced periodic flush like you are proposing here sounds OK to
me, but the position of the flush could be positioned better in the
loop.  If there is a SIGUSR2 (aka got_SIGUSR2 is true), WAL senders
would shut down, so it seems rather pointless to do a flush just
before exiting the process in WalSndDone(), no?  I'd suggest to move
the flush attempt closer to where we wait for some activity, just
after WalSndKeepaliveIfNecessary().
--
Michael

Attachment

signature.asc

Re: [BUG]: the walsender does not update its IO statistics until it exits

From

Bertrand Drouvot

Date:

26 February, 12:48:50

Hi,

On Wed, Feb 26, 2025 at 03:37:10PM +0900, Michael Paquier wrote:
> On Tue, Feb 25, 2025 at 01:42:08PM +0000, Bertrand Drouvot wrote:
> > Now we can see that the numbers increased for the relation object and that we
> > get non zeros numbers for the wal object too (which makes fully sense).
> > 
> > With the attached patch applied, we would get the same numbers already in
> > step 4. (means the stats are flushed without the need to wait for the walsender
> > to exit).
> 
> @@ -2793,6 +2794,12 @@ WalSndLoop(WalSndSendDataCallback send_data)
>          if (pq_flush_if_writable() != 0)
>              WalSndShutdown();
>  
> +        /*
> +         * Report IO statistics
> +         */
> +        pgstat_flush_io(false);
> +        (void) pgstat_flush_backend(false, PGSTAT_BACKEND_FLUSH_IO);
> +
>          /* If nothing remains to be sent right now ... */
>          if (WalSndCaughtUp && !pq_is_send_pending())
>          {
> 
> That's bad, worse for a logical WAL sender, because it means that we
> have no idea what kind of I/O happens in this process until it exits,
> and logical WAL senders could loop forever, since v16 where we've
> begun tracking I/O.

Yeah... And while the example shared up-thread is related to logical walsender,
the same issue exists for a physical walsender.

OTOH, It's also great to see that the new stats that have been added (the WAL
ones) helped to spot the issue.

> A non-forced periodic flush like you are proposing here sounds OK to
> me,

Thanks for looking at it!

> but the position of the flush could be positioned better in the
> loop.  If there is a SIGUSR2 (aka got_SIGUSR2 is true), WAL senders
> would shut down,

That's true for a physical walsender but I'm not sure it is for a logical
walsender (due to the "sentPtr == replicatedPtr" check in WalSndDone()).

> so it seems rather pointless to do a flush just
> before exiting the process in WalSndDone(), no?  I'd suggest to move
> the flush attempt closer to where we wait for some activity, just
> after WalSndKeepaliveIfNecessary().

Yeah I think that makes sense, done that way in the attached.

Speaking about physical walsender, I moved the test to 001_stream_rep.pl instead
(would also fail without the fix).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

v2-0001-Flush-the-IO-statistics-of-active-walsenders.patch

Re: [BUG]: the walsender does not update its IO statistics until it exits

From

Andres Freund

Date:

26 February, 13:08:17

Hi,

On 2025-02-26 15:37:10 +0900, Michael Paquier wrote:
> That's bad, worse for a logical WAL sender, because it means that we
> have no idea what kind of I/O happens in this process until it exits,
> and logical WAL senders could loop forever, since v16 where we've
> begun tracking I/O.

FWIW, I think medium term we need to work on splitting stats flushing into two
separate kinds of flushes:
1) non-transactional stats, which should be flushed at a regular interval,
   unless a process is completely idle
2) transaction stats, which can only be flushed at transaction boundaries,
   because before the transaction boundary we don't know if e.g. newly
   inserted rows should be counted as live or dead

So far we have some timer logic for 2), but we have basically no support for
1). Which means we have weird ad-hoc logic in various kinds of
non-plain-connection processes. And that will often have holes, as Bertrand
noticed here.

I think it's also bad that we don't have a solution for 1), even just for
normal connections. If a backend causes a lot of IO we might want to know
about that long before the longrunning transaction commits.

I suspect the right design here would be to have a generalized form of the
timeout mechanism we have for 2).

For that we'd need to make sure that pgstat_report_stat() can be safely called
inside a transaction.  The second part would be to redesign the
IdleStatsUpdateTimeoutPending mechanism so it is triggered independent of
idleness, without introducing unacceptable overhead - I think that's doable.

Greetings,

Andres Freund

Re: [BUG]: the walsender does not update its IO statistics until it exits

From

Michael Paquier

Date:

27 February, 06:09:46

On Wed, Feb 26, 2025 at 05:08:17AM -0500, Andres Freund wrote:
> I think it's also bad that we don't have a solution for 1), even just for
> normal connections. If a backend causes a lot of IO we might want to know
> about that long before the longrunning transaction commits.
>
> I suspect the right design here would be to have a generalized form of the
> timeout mechanism we have for 2).
>
> For that we'd need to make sure that pgstat_report_stat() can be safely called
> inside a transaction.  The second part would be to redesign the
> IdleStatsUpdateTimeoutPending mechanism so it is triggered independent of
> idleness, without introducing unacceptable overhead - I think that's doable.

Agreed that something in the lines of non-transaction update of the
entries could be adapted in some cases, so +1 for the idea.  I suspect
that there would be cases where a single stats kind should be able to
handle both transactional and non-transactional flush cases.
Splitting that across two stats kinds would lead to a lot of
duplication.
--
Michael

Attachment

signature.asc

Re: [BUG]: the walsender does not update its IO statistics until it exits

From

Bertrand Drouvot

Date:

27 February, 08:51:57

Hi,

On Wed, Feb 26, 2025 at 05:08:17AM -0500, Andres Freund wrote:
> Hi,
> 
> On 2025-02-26 15:37:10 +0900, Michael Paquier wrote:
> > That's bad, worse for a logical WAL sender, because it means that we
> > have no idea what kind of I/O happens in this process until it exits,
> > and logical WAL senders could loop forever, since v16 where we've
> > begun tracking I/O.
> 
> FWIW, I think medium term we need to work on splitting stats flushing into two
> separate kinds of flushes:
> 1) non-transactional stats, which should be flushed at a regular interval,
>    unless a process is completely idle
> 2) transaction stats, which can only be flushed at transaction boundaries,
>    because before the transaction boundary we don't know if e.g. newly
>    inserted rows should be counted as live or dead
> 
> So far we have some timer logic for 2), but we have basically no support for
> 1). Which means we have weird ad-hoc logic in various kinds of
> non-plain-connection processes. And that will often have holes, as Bertrand
> noticed here.

Thanks for sharing your thoughts on it!

Yeah, agree that's a good medium term idea to avoid missing flushing some
stats.

> I think it's also bad that we don't have a solution for 1), even just for
> normal connections. If a backend causes a lot of IO we might want to know
> about that long before the longrunning transaction commits.

+++1 There is no need to wait for the transaction boundary for some stats as
those would be more "useful"/"actionable" if flushed during the transaction is
in progress.

> I suspect the right design here would be to have a generalized form of the
> timeout mechanism we have for 2).
> 
> For that we'd need to make sure that pgstat_report_stat() can be safely called
> inside a transaction.  The second part would be to redesign the
> IdleStatsUpdateTimeoutPending mechanism so it is triggered independent of
> idleness, without introducing unacceptable overhead - I think that's doable.

Adding this to my bucket unless someone beats me on it.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [BUG]: the walsender does not update its IO statistics until it exits

From

Andres Freund

Date:

27 February, 13:06:15

On 2025-02-27 12:09:46 +0900, Michael Paquier wrote:
> I suspect that there would be cases where a single stats kind should be able
> to handle both transactional and non-transactional flush cases.  Splitting
> that across two stats kinds would lead to a lot of duplication.

Agreed. Table stats will be the prime case where that's important.

Re: [BUG]: the walsender does not update its IO statistics until it exits

From

Michael Paquier

Date:

28 February, 08:41:34

On Wed, Feb 26, 2025 at 09:48:50AM +0000, Bertrand Drouvot wrote:
> Yeah I think that makes sense, done that way in the attached.
>
> Speaking about physical walsender, I moved the test to 001_stream_rep.pl instead
> (would also fail without the fix).

Hmm.  I was doing some more checks with this patch, and on closer look
I am wondering if the location you have chosen for the stats reports
is too aggressive: this requires a LWLock for the WAL sender backend
type taken in exclusive mode, with each step of WalSndLoop() taken
roughly each time a record or a batch of records is sent.  A single
installcheck with a primary/standby setup can lead to up to 50k stats
report calls.

With smaller records, the loop can become hotter, can't it?  Also,
there can be a high number of WAL senders on a single node, and I've
heard of some customers with complex logical decoding deployments with
dozens of logical WAL senders.  Isn't there a risk of having this code
path become a point of contention?  It seems to me that we should
benchmark this change more carefully, perhaps even reduce the
frequency of the report calls.
--
Michael

Attachment

signature.asc

Re: [BUG]: the walsender does not update its IO statistics until it exits

From

Michael Paquier

Date:

28 February, 11:44:20

On Fri, Feb 28, 2025 at 02:41:34PM +0900, Michael Paquier wrote:
> With smaller records, the loop can become hotter, can't it?  Also,
> there can be a high number of WAL senders on a single node, and I've
> heard of some customers with complex logical decoding deployments with
> dozens of logical WAL senders.  Isn't there a risk of having this code
> path become a point of contention?  It seems to me that we should
> benchmark this change more carefully, perhaps even reduce the
> frequency of the report calls.

One idea here would be to have on a single host one server with a set
of N pg_receivewal processes dumping their WAL segments into a tmpfs,
while a single session generates a bunch of records with a minimal
size using pg_logical_emit_message().  Monitoring the maximum
replication lag with pg_stat_replication and looking at some perf
profiles of the cluster should show how these stats reports affect the
replication setup efficiency.
--
Michael

Attachment

signature.asc

Re: [BUG]: the walsender does not update its IO statistics until it exits

From

Bertrand Drouvot

Date:

28 February, 13:39:31

Hi,

On Fri, Feb 28, 2025 at 02:41:34PM +0900, Michael Paquier wrote:
> On Wed, Feb 26, 2025 at 09:48:50AM +0000, Bertrand Drouvot wrote:
> > Yeah I think that makes sense, done that way in the attached.
> > 
> > Speaking about physical walsender, I moved the test to 001_stream_rep.pl instead
> > (would also fail without the fix).
> 
> Hmm.  I was doing some more checks with this patch, and on closer look
> I am wondering if the location you have chosen for the stats reports
> is too aggressive: this requires a LWLock for the WAL sender backend
> type taken in exclusive mode, with each step of WalSndLoop() taken
> roughly each time a record or a batch of records is sent.  A single
> installcheck with a primary/standby setup can lead to up to 50k stats
> report calls.

Yeah, what I can observe (installcheck with a primary/standby):

- extras flush are called about 70K times.

Among those 70K:

- about 575 are going after the "have_iostats" check in pgstat_io_flush_cb()
- about 575 are going after the PendingBackendStats pg_memory_is_all_zeros()
check in pgstat_flush_backend() (makes sense due to the above)
- about 575 are going after the PendingBackendStats.pending_io pg_memory_is_all_zeros()
check in pgstat_flush_backend_entry_io() (makes sense due to the above)

It means that only a very few of them are "really" flushing IO stats.

> With smaller records, the loop can become hotter, can't it?  Also,
> there can be a high number of WAL senders on a single node, and I've
> heard of some customers with complex logical decoding deployments with
> dozens of logical WAL senders.  Isn't there a risk of having this code
> path become a point of contention?  It seems to me that we should
> benchmark this change more carefully, perhaps even reduce the
> frequency of the report calls.

That sounds a good idea to measure the impact of those extra calls and see
if we'd need to mitigate the impacts. I'll collect some data.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [BUG]: the walsender does not update its IO statistics until it exits

From

Michael Paquier

Date:

03 March, 04:51:19

On Fri, Feb 28, 2025 at 10:39:31AM +0000, Bertrand Drouvot wrote:
> That sounds a good idea to measure the impact of those extra calls and see
> if we'd need to mitigate the impacts. I'll collect some data.

Thanks.
--
Michael

Attachment

signature.asc

Re: [BUG]: the walsender does not update its IO statistics until it exits

From

Bertrand Drouvot

Date:

03 March, 14:54:39

Hi,

On Mon, Mar 03, 2025 at 10:51:19AM +0900, Michael Paquier wrote:
> On Fri, Feb 28, 2025 at 10:39:31AM +0000, Bertrand Drouvot wrote:
> > That sounds a good idea to measure the impact of those extra calls and see
> > if we'd need to mitigate the impacts. I'll collect some data.

So I did some tests using only one walsender (given the fact that the extra
lock you mentioned above is "only" for this particular backend).

=== Test with pg_receivewal

I was using one pg_receivewal process and did some tests that way:

pgbench -n -c8 -j8 -T60 -f <(echo "SELECT pg_logical_emit_message(true, 'test', repeat('0', 1));";)

I did not measure any noticeable extra lag (I did measure the time it took
for pg_size_pretty(sent_lsn - write_lsn) from pg_stat_replication to be back
to zero).

During the pgbench run a "perf record --call-graph fp -p <walsender_pid>" would
report (perf report -n):

1. pgstat_flush_backend() appears at about 3%
2. pg_memory_is_all_zeros() at about 2.8%
3. pgstat_flush_io() at about 0.4%

So it does not look like what we're adding here can be seen as a primary bottleneck.

That said it looks like that there is room for improvment in pgstat_flush_backend()
and that relying on a "have_iostats" like variable would be better than those
pg_memory_is_all_zeros() calls.

That's done in 0001 attached, by doing so, pgstat_flush_backend() now appears at
about 0.2%.

=== Test with pg_recvlogical

Now it does not look like pg_receivewal had a lot of IO stats to report (looking at
pg_stat_get_backend_io() output for the walsender).

Doing the same test with "pg_recvlogical -d postgres -S logical_slot -f /dev/null  --start"
reports much more IO stats.

What I observe without the "have_iostats" optimization is:

1. I did not measure any noticeable extra lag
2. pgstat_flush_io() at about 5.5% (pgstat_io_flush_cb() at about 5.3%)
3  pgstat_flush_backend() at about 4.8%

and with the "have_iostats" optimization I now see pgstat_flush_backend() at
about 2.51%.

So it does not look like what we're adding here can be seen as a primary bottleneck
but that is probably worth implementing the "have_iostats" optimization attached.

Also, while I did not measure any noticeable extra lag, given the fact that 
pgstat_flush_io() shows at about 5.5% and pgstat_flush_backend() at about 2.5%,
that could still make sense to reduce the frequency of the flush calls, thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: [BUG]: the walsender does not update its IO statistics until it exits

From

Michael Paquier

Date:

10 March, 03:23:50

On Mon, Mar 03, 2025 at 11:54:39AM +0000, Bertrand Drouvot wrote:
> So it does not look like what we're adding here can be seen as a primary bottleneck
> but that is probably worth implementing the "have_iostats" optimization attached.
>
> Also, while I did not measure any noticeable extra lag, given the fact that
> pgstat_flush_io() shows at about 5.5% and pgstat_flush_backend() at about 2.5%,
> that could still make sense to reduce the frequency of the flush calls, thoughts?

Okay.  The number of reports really worry me anyway as proposed in the
patch.  It's not really complicated to see dozens of pgstats calls in
a single millisecond in the WAL sender, even with some of the
regression tests.  That's more present in the logirep scenarios, it
seems..

For example, here are some numbers based on some elog() calls planted
in walreceiver.c, looking at the logs of the TAP tests.  Through a
single check-world, I am able to gather the following information:
- The location proposed by the patch leads to 680k stats reports,
in WalSndLoop() because the "Block if we have unsent data".  Some
tests show quite some contention.
- A different location, in WalSndLoop() just before WalSndWait() leads
to 59k calls.  And there's some contention in the logirep tests..
- I've spotted a third candidate which looks pretty solid, actually:
WalSndWaitForWal() before WalSndWait().  This leads to 2.9k reports in
the whole test suite, with much less contention in the reports.  These
can still be rather frequent, up to ~50 calls per seconds, but that's
really less than the two others.

Stats data is useful as long as it is possible to get an idea of how
the system behaves, particularly with a steady workload.  More
frequent reports are useful for spikey data detection, showing more
noise.  Still, too many reports may cause the part gathering the
reports to become a bottleneck, while we want it to offer hints about
bottlenecks.  So I would argue in favor of a more conservative choice
in the back branches than what the patch is proposing.  Choice 3 i'm
quoting above is tempting by design: not too much, still frequent
enough to offer enough relevant information in the stats.
--
Michael

Attachment

signature.asc

Re: [BUG]: the walsender does not update its IO statistics until it exits

From

Bertrand Drouvot

Date:

10 March, 17:52:42

Hi,

On Mon, Mar 10, 2025 at 09:23:50AM +0900, Michael Paquier wrote:
> On Mon, Mar 03, 2025 at 11:54:39AM +0000, Bertrand Drouvot wrote:
> > So it does not look like what we're adding here can be seen as a primary bottleneck
> > but that is probably worth implementing the "have_iostats" optimization attached.
> > 
> > Also, while I did not measure any noticeable extra lag, given the fact that 
> > pgstat_flush_io() shows at about 5.5% and pgstat_flush_backend() at about 2.5%,
> > that could still make sense to reduce the frequency of the flush calls, thoughts?
> 
> - I've spotted a third candidate which looks pretty solid, actually:
> WalSndWaitForWal() before WalSndWait().  This leads to 2.9k reports in
> the whole test suite, with much less contention in the reports.  These
> can still be rather frequent, up to ~50 calls per seconds, but that's
> really less than the two others.
> 
> Stats data is useful as long as it is possible to get an idea of how
> the system behaves, particularly with a steady workload.  More
> frequent reports are useful for spikey data detection, showing more
> noise.  Still, too many reports may cause the part gathering the
> reports to become a bottleneck, while we want it to offer hints about
> bottlenecks.  So I would argue in favor of a more conservative choice
> in the back branches than what the patch is proposing.

Yeah, fully agree. Anyway, having so many frequent stats reports makes little sense.

> Choice 3 i'm
> quoting above is tempting by design: not too much, still frequent
> enough to offer enough relevant information in the stats.

Yeah, I also agree that we should reduce the number of "reports". I'll look
at the third option.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [BUG]: the walsender does not update its IO statistics until it exits

From

Bertrand Drouvot

Date:

11 March, 09:10:32

Hi,

On Mon, Mar 10, 2025 at 02:52:42PM +0000, Bertrand Drouvot wrote:
> Hi,
> 
> On Mon, Mar 10, 2025 at 09:23:50AM +0900, Michael Paquier wrote:
> > On Mon, Mar 03, 2025 at 11:54:39AM +0000, Bertrand Drouvot wrote:
> > > So it does not look like what we're adding here can be seen as a primary bottleneck
> > > but that is probably worth implementing the "have_iostats" optimization attached.
> > > 
> > > Also, while I did not measure any noticeable extra lag, given the fact that 
> > > pgstat_flush_io() shows at about 5.5% and pgstat_flush_backend() at about 2.5%,
> > > that could still make sense to reduce the frequency of the flush calls, thoughts?
> > 
> > - I've spotted a third candidate which looks pretty solid, actually:
> > WalSndWaitForWal() before WalSndWait().  This leads to 2.9k reports in
> > the whole test suite, with much less contention in the reports.  These
> > can still be rather frequent, up to ~50 calls per seconds, but that's
> > really less than the two others.
> > 
> > Stats data is useful as long as it is possible to get an idea of how
> > the system behaves, particularly with a steady workload.  More
> > frequent reports are useful for spikey data detection, showing more
> > noise.  Still, too many reports may cause the part gathering the
> > reports to become a bottleneck, while we want it to offer hints about
> > bottlenecks.  So I would argue in favor of a more conservative choice
> > in the back branches than what the patch is proposing.
> 
> Yeah, fully agree. Anyway, having so many frequent stats reports makes little sense.
> 
> > Choice 3 i'm
> > quoting above is tempting by design: not too much, still frequent
> > enough to offer enough relevant information in the stats.
> 
> Yeah, I also agree that we should reduce the number of "reports". I'll look
> at the third option.

WalSndWaitForWal() is being used only for logical walsender. So we'd need to 
find another location for the physical walsender case. One option is to keep the
WalSndLoop() location and control the reports frequency.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [BUG]: the walsender does not update its IO statistics until it exits

From

Bertrand Drouvot

Date:

13 March, 12:31:29

Hi,

On Tue, Mar 11, 2025 at 06:10:32AM +0000, Bertrand Drouvot wrote:
> WalSndWaitForWal() is being used only for logical walsender. So we'd need to 
> find another location for the physical walsender case. One option is to keep the
> WalSndLoop() location and control the reports frequency.

I ended up with an idea in the same vein as your third one, but let's do it
in WalSndLoop() instead for both logical and physical walsenders. The idea
is to report the stats when the walsender is caught up or has pending data to
send. Also, we do flush only periodically (based on PGSTAT_MIN_INTERVAL).

Remarks/Questions:

1. maybe relying on PGSTAT_IDLE_INTERVAL would make more sense? In both case
PGSTAT_MIN_INTERVAL or PGSTAT_MIN_INTERVAL, I'm not sure there is a need to 
update the related doc.

2. bonus point to make it here is that we don't need an extra call to
GetCurrentTimestamp(). We simply assign the one that is already done to "now".

3. add to use poll_query_until() has I've seen failures on the CI whith this
new approach. Also moving the test far away from the stats reset to to minimize
the risk of polling for too long.

With the attached in place, the number of times the stats are being flushed
reduce drastically but still seems enough to get an accurate idea of the walsender
IO activity. Thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

v4-0001-Flush-the-IO-statistics-of-active-walsenders.patch

Fwd: [BUG]: the walsender does not update its IO statistics until it exits

From

Xuneng Zhou

Date:

13 March, 14:33:24

Forgot to cc...

---------- Forwarded message ---------
发件人： Xuneng Zhou <xunengzhou@gmail.com>
Date: 2025年3月13日周四 19:15
Subject: Re: [BUG]: the walsender does not update its IO statistics until it exits
To: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>

Hi,

Thanks for working on this! I'm glad to see that the patch (https://www.postgresql.org/message-id/flat/Z3zqc4o09dM/Ezyz@ip-10-97-1-34.eu-west-3.compute.internal) has been committed.

Regarding patch 0001, the optimization in pgstat_backend_have_pending_cb looks good:

bool
pgstat_backend_have_pending_cb(void)
{
- return (!pg_memory_is_all_zeros(&PendingBackendStats,
- sizeof(struct PgStat_BackendPending)));
+ return backend_has_iostats;
}

Additionally, the function pgstat_flush_backend includes the check:

+ if (!pgstat_backend_have_pending_cb())    return false;

However, I think we might need to revise the comment (and possibly the function name) for clarity:

/* * Check if there are any backend stats waiting to be flushed. */

Originally, this function was intended to check multiple types of backend statistics, which made sense when PendingBackendStats was the centralized structure for various pending backend stats. However, since PgStat_PendingWalStats was removed from PendingBackendStats earlier, and now this patch introduces the backend_has_iostats variable, the scope of this function appears even narrower. This narrowed functionality no longer aligns closely with the original function name and its associated comment.

Bertrand Drouvot <bertranddrouvot.pg@gmail.com> 于2025年3月3日周一 19:54写道：

Hi,

On Mon, Mar 03, 2025 at 10:51:19AM +0900, Michael Paquier wrote:
> On Fri, Feb 28, 2025 at 10:39:31AM +0000, Bertrand Drouvot wrote:
> > That sounds a good idea to measure the impact of those extra calls and see
> > if we'd need to mitigate the impacts. I'll collect some data.

So I did some tests using only one walsender (given the fact that the extra
lock you mentioned above is "only" for this particular backend).

=== Test with pg_receivewal

I was using one pg_receivewal process and did some tests that way:

pgbench -n -c8 -j8 -T60 -f <(echo "SELECT pg_logical_emit_message(true, 'test', repeat('0', 1));";)

I did not measure any noticeable extra lag (I did measure the time it took
for pg_size_pretty(sent_lsn - write_lsn) from pg_stat_replication to be back
to zero).

During the pgbench run a "perf record --call-graph fp -p <walsender_pid>" would
report (perf report -n):

1. pgstat_flush_backend() appears at about 3%
2. pg_memory_is_all_zeros() at about 2.8%
3. pgstat_flush_io() at about 0.4%

So it does not look like what we're adding here can be seen as a primary bottleneck.

That said it looks like that there is room for improvment in pgstat_flush_backend()
and that relying on a "have_iostats" like variable would be better than those
pg_memory_is_all_zeros() calls.

That's done in 0001 attached, by doing so, pgstat_flush_backend() now appears at
about 0.2%.

=== Test with pg_recvlogical

Now it does not look like pg_receivewal had a lot of IO stats to report (looking at
pg_stat_get_backend_io() output for the walsender).

Doing the same test with "pg_recvlogical -d postgres -S logical_slot -f /dev/null --start"
reports much more IO stats.

What I observe without the "have_iostats" optimization is:

1. I did not measure any noticeable extra lag
2. pgstat_flush_io() at about 5.5% (pgstat_io_flush_cb() at about 5.3%)
3 pgstat_flush_backend() at about 4.8%

and with the "have_iostats" optimization I now see pgstat_flush_backend() at
about 2.51%.

So it does not look like what we're adding here can be seen as a primary bottleneck
but that is probably worth implementing the "have_iostats" optimization attached.

Also, while I did not measure any noticeable extra lag, given the fact that
pgstat_flush_io() shows at about 5.5% and pgstat_flush_backend() at about 2.5%,
that could still make sense to reduce the frequency of the flush calls, thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Fwd: [BUG]: the walsender does not update its IO statistics until it exits

From

Xuneng Zhou

Date:

13 March, 14:35:24

Sorry, forgot to cc this one too.

---------- Forwarded message ---------
发件人： Xuneng Zhou <xunengzhou@gmail.com>
Date: 2025年3月13日周四 19:29
Subject: Re: [BUG]: the walsender does not update its IO statistics until it exits
To: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>

Hi,
Given that PGSTAT_MIN_INTERVAL is the go-to setting for stats flushes
* Unless called with 'force', pending stats updates are flushed happen once
* per PGSTAT_MIN_INTERVAL (1000ms). When not forced, stats flushes do not
* block on lock acquisition, except if stats updates have been pending for
* longer than PGSTAT_MAX_INTERVAL (60000ms).
*
* Whenever pending stats updates remain at the end of pgstat_report_stat() a
* suggested idle timeout is returned. Currently this is always
* PGSTAT_IDLE_INTERVAL (10000ms). Callers can use the returned time to set up
* a timeout after which to call pgstat_report_stat(true), but are not
* required to do so.

I am curious about the reason for this:

1. maybe relying on PGSTAT_IDLE_INTERVAL would make more sense? In both case
PGSTAT_MIN_INTERVAL or PGSTAT_MIN_INTERVAL, I'm not sure there is a need to
update the related doc.

PGSTAT_IDLE_INTERVAL seems to reduce the frequency even more.

With the attached in place, the number of times the stats are being flushed
reduce drastically but still seems enough to get an accurate idea of the walsender
IO activity. Thoughts?

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

From

Bertrand Drouvot

Date:

13 March, 16:18:45

Hi,

On Thu, Mar 13, 2025 at 07:33:24PM +0800, Xuneng Zhou wrote:
> Regarding patch 0001, the optimization in pgstat_backend_have_pending_cb
> looks good:

Thanks for looking at it!

> bool
> pgstat_backend_have_pending_cb(void)
> {
> - return (!pg_memory_is_all_zeros(&PendingBackendStats,
> - sizeof(struct PgStat_BackendPending)));
> + return backend_has_iostats;
> }
> 
> Additionally, the function pgstat_flush_backend includes the check:
> 
> + if (!pgstat_backend_have_pending_cb())
>     return false;
> 
> However, I think we might need to revise the comment (and possibly the
> function name) for clarity:
> 
> /*
>  * Check if there are any backend stats waiting to be flushed.
>  */

The comment is not exactly this one on the current HEAD, it looks like that you're
looking at a previous version of the core code.

> Originally, this function was intended to check multiple types of backend
> statistics, which made sense when PendingBackendStats was the centralized
> structure for various pending backend stats. However, since
> PgStat_PendingWalStats was removed from PendingBackendStats earlier, and
> now this patch introduces the backend_has_iostats variable, the scope of
> this function appears even narrower. This narrowed functionality no longer
> aligns closely with the original function name and its associated comment.

I don't think so, as since 76def4cdd7c, a pgstat_backend_wal_have_pending() check
has been added to pgstat_backend_have_pending_cb(). You're probably looking at
a version prior to 76def4cdd7c.

This particular sub-patch needs a rebase though, done in the attached. 0001
remains unchanged as compared to the v4 one just shared up-thread. If 0001 goes
in, merging 0002 would be less beneficial (as compared to v3).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

From

Bertrand Drouvot

Date:

13 March, 16:19:51

Hi,

On Thu, Mar 13, 2025 at 07:35:24PM +0800, Xuneng Zhou wrote:
> Hi,
> Given that PGSTAT_MIN_INTERVAL is the go-to setting for stats flushes
>  * Unless called with 'force', pending stats updates are flushed happen once
>  * per PGSTAT_MIN_INTERVAL (1000ms). When not forced, stats flushes do not
>  * block on lock acquisition, except if stats updates have been pending for
>  * longer than PGSTAT_MAX_INTERVAL (60000ms).
>  *
>  * Whenever pending stats updates remain at the end of pgstat_report_stat()
> a
>  * suggested idle timeout is returned. Currently this is always
>  * PGSTAT_IDLE_INTERVAL (10000ms). Callers can use the returned time to set
> up
>  * a timeout after which to call pgstat_report_stat(true), but are not
>  * required to do so.
> 
>  I am curious about the reason for this:

Thanks for looking at it!

> > 1. maybe relying on PGSTAT_IDLE_INTERVAL would make more sense? In both
> > case
> > PGSTAT_MIN_INTERVAL or PGSTAT_MIN_INTERVAL, I'm not sure there is a need to
> > update the related doc.
> >
> >
> PGSTAT_IDLE_INTERVAL seems to reduce the frequency even more.

Yeah, I think that PGSTAT_MIN_INTERVAL is the one to use (that's why that's the 
one the patch is using). I just mentioned PGSTAT_IDLE_INTERVAL as an open door
for conversation in case one prefers a "larger" frequency. 

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

From

Michael Paquier

Date:

18 March, 11:11:02

On Thu, Mar 13, 2025 at 01:18:45PM +0000, Bertrand Drouvot wrote:
> This particular sub-patch needs a rebase though, done in the attached. 0001
> remains unchanged as compared to the v4 one just shared up-thread. If 0001 goes
> in, merging 0002 would be less beneficial (as compared to v3).

PgStat_PendingIO is already quite large, could get larger, and the
backend stats would be a natural victim of that.  What you are
proposing in v4-0002 is beneficial in any case outside of the specific
problem of this thread with WAL senders.  It is new code on HEAD and
pgstat_flush_backend_entry_io() being the only entry point where IO
data is appended makes it easy to reason about.  Just to say that I am
planning to apply this part on HEAD.
--
Michael

Attachment

signature.asc

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

From

Bertrand Drouvot

Date:

18 March, 12:51:14

Hi,

On Tue, Mar 18, 2025 at 05:11:02PM +0900, Michael Paquier wrote:
> On Thu, Mar 13, 2025 at 01:18:45PM +0000, Bertrand Drouvot wrote:
> > This particular sub-patch needs a rebase though, done in the attached. 0001
> > remains unchanged as compared to the v4 one just shared up-thread. If 0001 goes
> > in, merging 0002 would be less beneficial (as compared to v3).
> 
> PgStat_PendingIO is already quite large, could get larger, and the
> backend stats would be a natural victim of that.  What you are
> proposing in v4-0002 is beneficial in any case outside of the specific
> problem of this thread with WAL senders.

Yeah, I also think it makes sense independently of what has been discussed
in this thread.

> Just to say that I am planning to apply this part on HEAD.

Thanks!

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

From

Xuneng Zhou

Date:

18 March, 14:14:12

Hi,

I performed some tests using elog(no so sure whether this is the proper way to do it) to monitor the new method. Here are the findings:

• With PGSTAT_MIN_INTERVAL set to 1000ms, the number of flush reports was reduced to approximately 40–50 during the installcheck test suite.

• With PGSTAT_IDLE_INTERVAL set to 10000ms, the reports dropped to fewer than 5.

• In contrast, the previous approach—flushing after every WalSndKeepaliveIfNecessary()—resulted in roughly 50,000 flushes.

This reduction is significant, so the overhead from the flush reports is no longer a concern. However, we still need to determine whether this frequency is sufficient to capture the system’s state during periods of high WAL activity. Based on my tests, using PGSTAT_MIN_INTERVAL seems to provide a better balance than PGSTAT_IDLE_INTERVAL.

> > 1. maybe relying on PGSTAT_IDLE_INTERVAL would make more sense? In both
> > case
> > PGSTAT_MIN_INTERVAL or PGSTAT_MIN_INTERVAL, I'm not sure there is a need to
> > update the related doc.
> >
> >
> PGSTAT_IDLE_INTERVAL seems to reduce the frequency even more.

Yeah, I think that PGSTAT_MIN_INTERVAL is the one to use (that's why that's the
one the patch is using). I just mentioned PGSTAT_IDLE_INTERVAL as an open door
for conversation in case one prefers a "larger" frequency.

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

From

Bertrand Drouvot

Date:

18 March, 17:45:15

Hi,

On Tue, Mar 18, 2025 at 07:14:12PM +0800, Xuneng Zhou wrote:
> Hi,
> 
> I performed some tests using elog

Thanks for the testing!

> (no so sure whether this is the proper way
> to do it) to monitor the new method.

Well that simple enough and that works well if the goal is just to "count" ;-)

> Here are the findings:
> 
> • With PGSTAT_MIN_INTERVAL set to 1000ms, the number of flush reports was
> reduced to approximately 40–50 during the installcheck test suite.
> 
> • With PGSTAT_IDLE_INTERVAL set to 10000ms, the reports dropped to fewer
> than 5.
> 
> • In contrast, the previous approach—flushing after every
> WalSndKeepaliveIfNecessary()—resulted in roughly 50,000 flushes.
> 
> This reduction is significant, so the overhead from the flush reports is no
> longer a concern.

Yeah I did observe and do think the same.

> However, we still need to determine whether this
> frequency is sufficient to capture the system’s state during periods of
> high WAL activity.

I think that reporting at PGSTAT_MIN_INTERVAL is fine and more than enough. I
mean, I 'm not sure that there is a real use case to query the statistics related
view at more than a second interval anyway. Or are you concerned that we may not
enter the "When the WAL sender is caught up or has pending data to send" frequently
enough?

> Based on my tests, using PGSTAT_MIN_INTERVAL seems to
> provide a better balance than PGSTAT_IDLE_INTERVAL.

Same here.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

From

Michael Paquier

Date:

19 March, 02:19:42

On Tue, Mar 18, 2025 at 09:51:14AM +0000, Bertrand Drouvot wrote:
> Thanks!

Done that part for now.  It will be possible to look at the bug fix
after the release freeze as it impacts stable branches as well.
--
Michael

Attachment

signature.asc

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

From

Xuneng Zhou

Date:

19 March, 04:53:16

Hi,

I am OK with PGSTAT_MIN_INTERVAL. It has been used for flushing other statistics as well. And monitoring systems are generally configured to poll at one-second or longer intervals.

I think that reporting at PGSTAT_MIN_INTERVAL is fine and more than enough. I
mean, I 'm not sure that there is a real use case to query the statistics related
view at more than a second interval anyway.

I think these two conditions are good too. In a busy system, they are met frequently, so the flush routine will be executed at least once every second. Conversely, when WAL generation is low, there's simply less data to record, and the flush frequency naturally decreases.

Or are you concerned that we may not
enter the "When the WAL sender is caught up or has pending data to send" frequently
enough?

> Based on my tests, using PGSTAT_MIN_INTERVAL seems to
> provide a better balance than PGSTAT_IDLE_INTERVAL.

Same here.

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

From

Michael Paquier

Date:

19 March, 08:26:47

On Wed, Mar 19, 2025 at 09:53:16AM +0800, Xuneng Zhou wrote:
> I think these two conditions are good too. In a busy system, they are met
> frequently, so the flush routine will be executed at least once every
> second. Conversely, when WAL generation is low, there's simply less data to
> record, and the flush frequency naturally decreases.

Hmm, yeah, perhaps this is acceptable.  The changes in pgstat.c seem
inconsistent, though, only moving the min interval while the max and
idle times stay around.

This also make me wonder if we should work towards extending
pgstat_report_stat(), so as we save in GetCurrentTimestamp() while
making the internals still local to pgstat.c.  Or perhaps not in the
scope of a backpatchable design.
--
Michael

Attachment

signature.asc

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

From

Bertrand Drouvot

Date:

19 March, 08:26:54

Hi,

On Wed, Mar 19, 2025 at 08:19:42AM +0900, Michael Paquier wrote:
> 
> Done that part for now.

Thanks!

> It will be possible to look at the bug fix
> after the release freeze as it impacts stable branches as well.

Yes, let's do that. Adding a reminder on my side.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

From

Bertrand Drouvot

Date:

19 March, 08:50:28

Hi,

On Wed, Mar 19, 2025 at 02:26:47PM +0900, Michael Paquier wrote:
> On Wed, Mar 19, 2025 at 09:53:16AM +0800, Xuneng Zhou wrote:
> > I think these two conditions are good too. In a busy system, they are met
> > frequently, so the flush routine will be executed at least once every
> > second. Conversely, when WAL generation is low, there's simply less data to
> > record, and the flush frequency naturally decreases.
> 
> Hmm, yeah, perhaps this is acceptable.  The changes in pgstat.c seem
> inconsistent, though, only moving the min interval while the max and
> idle times stay around.

That's right. OTOH that sounds weird to move the others 2: that would create 
wider visibility for them without real needs. That's not a big issue,
but could impact extensions or friends that would start using those should we
change their values in the future. 

Another option could be to create a dedicated one in walsender.c but I'm not
sure I like it more.

> This also make me wonder if we should work towards extending
> pgstat_report_stat(), so as we save in GetCurrentTimestamp() while
> making the internals still local to pgstat.c.  Or perhaps not in the
> scope of a backpatchable design.

I see. I think I'd vote for making the same changes on HEAD and the STABLE branches
as a first step. And then think about a new "design" that could be part of the
ideas suggested by Andres in [1]. Thoughts?

Regards,

[1]: https://www.postgresql.org/message-id/erpzwxoptqhuptdrtehqydzjapvroumkhh7lc6poclbhe7jk7l%40l3yfsq5q4pw7

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

From

Xuneng Zhou

Date:

19 March, 11:00:49

Hi,
Moving the other two provides a more complete view of the settings. For newcomers(like me) to the codebase, seeing all three related values in one place helps avoid a narrow view of the settings.

But I am not sure that I understand the cons of this well.

Bertrand Drouvot <bertranddrouvot.pg@gmail.com> 于2025年3月19日周三 13:50写道：

Hi,

On Wed, Mar 19, 2025 at 02:26:47PM +0900, Michael Paquier wrote:
> On Wed, Mar 19, 2025 at 09:53:16AM +0800, Xuneng Zhou wrote:
> > I think these two conditions are good too. In a busy system, they are met
> > frequently, so the flush routine will be executed at least once every
> > second. Conversely, when WAL generation is low, there's simply less data to
> > record, and the flush frequency naturally decreases.
>
> Hmm, yeah, perhaps this is acceptable. The changes in pgstat.c seem
> inconsistent, though, only moving the min interval while the max and
> idle times stay around.

That's right. OTOH that sounds weird to move the others 2: that would create
wider visibility for them without real needs. That's not a big issue,
but could impact extensions or friends that would start using those should we
change their values in the future.

The current name of the min interval appears consistent within the context of the surrounding code in this patch.

Another option could be to create a dedicated one in walsender.c but I'm not
sure I like it more.

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

From

Michael Paquier

Date:

24 March, 02:41:20

On Wed, Mar 19, 2025 at 04:00:49PM +0800, Xuneng Zhou wrote:
> Hi,
> Moving the other two provides a more complete view of the settings. For
> newcomers(like me) to the codebase, seeing all three related values in one
> place helps avoid a narrow view of the settings.
>
> But I am not sure that I understand the cons of this well.

While I don't disagree with the use of a hardcoded interval of time to
control timing the flush of the WAL sender stats, do we really need to
rely on the timing defined by pgstat.c?  Wouldn't it be simpler to
assign one in walsender.c and pick up a different, perhaps higher,
value?

At the end the timestamp calculations are free because we can rely on
the existing call of GetCurrentTimestamp() for the physical WAL
senders to get an idea of the current time.  For the logical WAL
senders, perhaps we'd better do the reports in WalSndWaitForWal(),
actually.  There is also a call to GetCurrentTimestamp() that we can
rely on in this path.
--
Michael

Attachment

signature.asc

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

From

Bertrand Drouvot

Date:

31 March, 16:05:24

Hi,

On Mon, Mar 24, 2025 at 08:41:20AM +0900, Michael Paquier wrote:
> On Wed, Mar 19, 2025 at 04:00:49PM +0800, Xuneng Zhou wrote:
> > Hi,
> > Moving the other two provides a more complete view of the settings. For
> > newcomers(like me) to the codebase, seeing all three related values in one
> > place helps avoid a narrow view of the settings.
> > 
> > But I am not sure that I understand the cons of this well.
> 
> While I don't disagree with the use of a hardcoded interval of time to
> control timing the flush of the WAL sender stats, do we really need to
> rely on the timing defined by pgstat.c?

No but I thought it could make sense.

> Wouldn't it be simpler to
> assign one in walsender.c and pick up a different, perhaps higher,
> value?

I don't have a strong opinion on it so done as suggested above in the attached.

I think that the 1s value is fine because: 1. it is consistent with 
PGSTAT_MIN_INTERVAL and 2. it already needs that the sender is caught up or
has pending data to send (means it could be higher than 1s already). That said,
I don't think that would hurt if you think of a higher value.

> At the end the timestamp calculations are free because we can rely on
> the existing call of GetCurrentTimestamp() for the physical WAL
> senders to get an idea of the current time.

Yup

> For the logical WAL
> senders, perhaps we'd better do the reports in WalSndWaitForWal(),
> actually.  There is also a call to GetCurrentTimestamp() that we can
> rely on in this path.

I think it's better to flush the stats in a shared code path. I think it's
easier to maintain and that there is no differences between logical and
physical walsenders that would justify to flush the stats in specific code
paths.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

v5-0001-Flush-the-IO-statistics-of-active-walsenders.patch

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

From

vignesh C

Date:

03 April, 06:55:02

On Mon, 31 Mar 2025 at 18:35, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On Mon, Mar 24, 2025 at 08:41:20AM +0900, Michael Paquier wrote:
> > On Wed, Mar 19, 2025 at 04:00:49PM +0800, Xuneng Zhou wrote:
> > > Hi,
> > > Moving the other two provides a more complete view of the settings. For
> > > newcomers(like me) to the codebase, seeing all three related values in one
> > > place helps avoid a narrow view of the settings.
> > >
> > > But I am not sure that I understand the cons of this well.
> >
> > While I don't disagree with the use of a hardcoded interval of time to
> > control timing the flush of the WAL sender stats, do we really need to
> > rely on the timing defined by pgstat.c?
>
> No but I thought it could make sense.
>
> > Wouldn't it be simpler to
> > assign one in walsender.c and pick up a different, perhaps higher,
> > value?
>
> I don't have a strong opinion on it so done as suggested above in the attached.
>
> I think that the 1s value is fine because: 1. it is consistent with
> PGSTAT_MIN_INTERVAL and 2. it already needs that the sender is caught up or
> has pending data to send (means it could be higher than 1s already). That said,
> I don't think that would hurt if you think of a higher value.
>
> > At the end the timestamp calculations are free because we can rely on
> > the existing call of GetCurrentTimestamp() for the physical WAL
> > senders to get an idea of the current time.
>
> Yup
>
> > For the logical WAL
> > senders, perhaps we'd better do the reports in WalSndWaitForWal(),
> > actually.  There is also a call to GetCurrentTimestamp() that we can
> > rely on in this path.
>
> I think it's better to flush the stats in a shared code path. I think it's
> easier to maintain and that there is no differences between logical and
> physical walsenders that would justify to flush the stats in specific code
> paths.

Couple of suggestions:
1) I felt we can include a similar verification for one of the logical
replication tests too:
+# Wait for the walsender to update its IO statistics.

+# Has to be done before the next restart and far enough from the
+# pg_stat_reset_shared('io') to minimize the risk of polling for too long.
+$node_primary->poll_query_until(
+ 'postgres',
+ qq[SELECT sum(reads) > 0
+       FROM pg_catalog.pg_stat_io
+       WHERE backend_type = 'walsender'
+       AND object = 'wal']
+  )
+  or die
+  "Timed out while waiting for the walsender to update its IO statistics";
+

2) We can comment this in a single line itself:
+ /*
+ * Report IO statistics
+ */

Regards,
Vignesh

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

From

Bertrand Drouvot

Date:

03 April, 10:24:46

Hi,

On Thu, Apr 03, 2025 at 09:25:02AM +0530, vignesh C wrote:
> On Mon, 31 Mar 2025 at 18:35, Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> >
> Couple of suggestions:

Thanks for looking at it!

> 1) I felt we can include a similar verification for one of the logical
> replication tests too:
> +# Wait for the walsender to update its IO statistics.
> 
> +# Has to be done before the next restart and far enough from the
> +# pg_stat_reset_shared('io') to minimize the risk of polling for too long.
> +$node_primary->poll_query_until(
> + 'postgres',
> + qq[SELECT sum(reads) > 0
> +       FROM pg_catalog.pg_stat_io
> +       WHERE backend_type = 'walsender'
> +       AND object = 'wal']
> +  )
> +  or die
> +  "Timed out while waiting for the walsender to update its IO statistics";
> +

Initially ([1]) it was added in 035_standby_logical_decoding.pl. But this
test (035) is already racy enough that I felt better to move it to 001_stream_rep.pl
(see [2]) instead.

I don't think that's a big issue as the code path being changed in the patch is
shared between logical and physical walsenders. That said that would not hurt to
add a logical walsender test, but I would prefer it to be outside
035_standby_logical_decoding.pl. Do you have a suggestion for the location of
such a test?

> 2) We can comment this in a single line itself:
> + /*
> + * Report IO statistics
> + */

Right, but:

- it's already done that way in walsender.c (see "Try to flush any pending output to the client"
for example).
- the exact same comment is written that way in pgstat_bgwriter.c and
pgstat_checkpointer.c

So that I just copy/paste it.

[1]: https://www.postgresql.org/message-id/Z73IsKBceoVd4t55%40ip-10-97-1-34.eu-west-3.compute.internal
[2]: https://www.postgresql.org/message-id/Z77jgvhwOu9S0a5r%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

From

vignesh C

Date:

03 April, 12:53:31

On Thu, 3 Apr 2025 at 12:54, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On Thu, Apr 03, 2025 at 09:25:02AM +0530, vignesh C wrote:
> > On Mon, 31 Mar 2025 at 18:35, Bertrand Drouvot
> > <bertranddrouvot.pg@gmail.com> wrote:
> > >
> > Couple of suggestions:
>
> Thanks for looking at it!
>
> > 1) I felt we can include a similar verification for one of the logical
> > replication tests too:
> > +# Wait for the walsender to update its IO statistics.
> >
> > +# Has to be done before the next restart and far enough from the
> > +# pg_stat_reset_shared('io') to minimize the risk of polling for too long.
> > +$node_primary->poll_query_until(
> > + 'postgres',
> > + qq[SELECT sum(reads) > 0
> > +       FROM pg_catalog.pg_stat_io
> > +       WHERE backend_type = 'walsender'
> > +       AND object = 'wal']
> > +  )
> > +  or die
> > +  "Timed out while waiting for the walsender to update its IO statistics";
> > +
>
> Initially ([1]) it was added in 035_standby_logical_decoding.pl. But this
> test (035) is already racy enough that I felt better to move it to 001_stream_rep.pl
> (see [2]) instead.
>
> I don't think that's a big issue as the code path being changed in the patch is
> shared between logical and physical walsenders. That said that would not hurt to
> add a logical walsender test, but I would prefer it to be outside
> 035_standby_logical_decoding.pl. Do you have a suggestion for the location of
> such a test?

Can we add it to one of the subscription tests, such as 001_rep_changes.pl?

Regards,
Vignesh

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

From

Bertrand Drouvot

Date:

04 April, 08:01:04

Hi,

On Thu, Apr 03, 2025 at 03:23:31PM +0530, vignesh C wrote:
> Can we add it to one of the subscription tests, such as 001_rep_changes.pl?

Yeah that sounds like a good place for it. Done in the attached.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

v6-0001-Flush-the-IO-statistics-of-active-walsenders.patch

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

From

vignesh C

Date:

04 April, 19:03:46

On Fri, 4 Apr 2025 at 10:31, Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On Thu, Apr 03, 2025 at 03:23:31PM +0530, vignesh C wrote:
> > Can we add it to one of the subscription tests, such as 001_rep_changes.pl?
>
> Yeah that sounds like a good place for it. Done in the attached.

The new test added currently passes even without the patch. It would
be ideal to have a test that fails without the patch and passes once
the patch is applied.

Regards,
Vignesh

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

From

Michael Paquier

Date:

07 April, 09:31:14

On Fri, Apr 04, 2025 at 09:33:46PM +0530, vignesh C wrote:
> The new test added currently passes even without the patch. It would
> be ideal to have a test that fails without the patch and passes once
> the patch is applied.

Right.  The subscription test and logical WAL senders passes without
the patch, because we are able to catch some WAL activity through
pgoutput.  The recovery test for physical WAL sender fails without the
patch on timeout.

We could need something more advanced here for the logical case, where
we could use pg_recvlogical started in the background with a hardcoded
endpos, or just kill the pg_recvlogical command once we have checked
the state of the stats.  I am not sure if this is worth the cycles
spent on, TBH, so I would be happy with just the physical case checked
in TAP as it's simpler because streaming replication makes that easy
to work with.

One thing that I'm a bit unhappy about in the patch is the choice to
do the stats updates in WalSndLoop() for the logical WAL sender case,
because this forces an extra GetCurrentTimestamp() call for each loop,
and that's never a cheap system call in what can be a hot code path.
How about doing the calculations in WalSndWaitForWal() for the logical
part, relying on the existing GetCurrentTimestamp() done there?
That's also where the waits are handled for the logical part, so there
may be a good point in keeping this code more symmetric for now,
rather than split it.

Saying that, here is a version 7 with all that included, which is
simpler to read.
--
Michael

Attachment

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

From

Bertrand Drouvot

Date:

07 April, 10:13:20

Hi,

On Mon, Apr 07, 2025 at 03:31:14PM +0900, Michael Paquier wrote:
> On Fri, Apr 04, 2025 at 09:33:46PM +0530, vignesh C wrote:
> > The new test added currently passes even without the patch. It would
> > be ideal to have a test that fails without the patch and passes once
> > the patch is applied.
> 
> Right.  The subscription test and logical WAL senders passes without
> the patch, because we are able to catch some WAL activity through
> pgoutput.  The recovery test for physical WAL sender fails without the
> patch on timeout.
> 
> We could need something more advanced here for the logical case, where
> we could use pg_recvlogical started in the background with a hardcoded
> endpos, or just kill the pg_recvlogical command once we have checked
> the state of the stats.  I am not sure if this is worth the cycles
> spent on,

Another option for the logical walsender is to reset the stats once the
subscription is created like in the attached. With the attached in place the
test now fails without the patch in place (and passes with the patch).

> One thing that I'm a bit unhappy about in the patch is the choice to
> do the stats updates in WalSndLoop() for the logical WAL sender case,
> because this forces an extra GetCurrentTimestamp() call for each loop,
> and that's never a cheap system call in what can be a hot code path.
> How about doing the calculations in WalSndWaitForWal() for the logical
> part, relying on the existing GetCurrentTimestamp() done there?
> That's also where the waits are handled for the logical part, so there
> may be a good point in keeping this code more symmetric for now,
> rather than split it.

Well, while I was working on a reproducer case for the logical walsender (that
lead to the attached), I realized that the test was looping for relatively long
time before passing. Then I was changing the patch to do *exactly* what you did
propose in v7 and the logical walsender test now passes quickly enough...

So it looks like that the idea of using a "shared" code path was not that good
and that spliting physical/logical as done in v7 is a good choice.

But...

> TBH, so I would be happy with just the physical case checked
> in TAP as it's simpler because streaming replication makes that easy
> to work with.

As we now have 2 code paths I think we "really" need 2 tests. The attached
(to apply on top of v7) seems to do the job.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

add_logical_walsender_test.txt

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

From

Michael Paquier

Date:

08 April, 02:00:48

On Mon, Apr 07, 2025 at 07:13:20AM +0000, Bertrand Drouvot wrote:
> As we now have 2 code paths I think we "really" need 2 tests. The attached
> (to apply on top of v7) seems to do the job.

Confirmed.  I am sold on this extra location on HEAD, and it does not
impact the run time of the test as there is enough activity happening
between the resets and the poll check.  So, done down to v16.
--
Michael

Attachment

signature.asc