Thread: Suppressing useless wakeups in walreceiver

Suppressing useless wakeups in walreceiver

From

Thomas Munro

Date:

27 January 2022, 10:50:04

Hi,

While working on WaitEventSet-ifying various codepaths, I found it
strange that walreceiver wakes up 10 times per second while idle.
Here's a draft patch to compute the correct sleep time.

Attachment

0001-Suppress-useless-wakeups-in-walreceiver.patch

Re: Suppressing useless wakeups in walreceiver

From

Kyotaro Horiguchi

Date:

28 January 2022, 07:26:22

Hello.

At Thu, 27 Jan 2022 23:50:04 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in 
> While working on WaitEventSet-ifying various codepaths, I found it
> strange that walreceiver wakes up 10 times per second while idle.
> Here's a draft patch to compute the correct sleep time.

Agree to the objective.  However I feel the patch makes the code
somewhat less readable maybe because WalRcvComputeNextWakeup hides the
timeout deatils.  Of course other might thing differently.

-              ping_sent = false;
-              XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1,
-                         startpointTLI);
+              WalRcvComputeNextWakeup(&state,
+                          WALRCV_WAKEUP_TIMEOUT,
+                          last_recv_timestamp);
+              WalRcvComputeNextWakeup(&state,
+                          WALRCV_WAKEUP_PING,
+                          last_recv_timestamp);

The calculated target times are not used within the closest loop and
the loop is quite busy. WALRCV_WAKEUP_PING is the replacement of
ping_sent, but I think the computation of both two target times would
be better done after the loop only when the "if (len > 0)" block was
passed.

-          XLogWalRcvSendReply(false, false);
+          XLogWalRcvSendReply(&state, GetCurrentTimestamp(),
+                    false, false);

The GetCurrentTimestamp() is same with last_recv_timestamp when the
recent recv() had any bytes received. So we can avoid the call to
GetCurrentTimestamp in that case. If we do the change above, the same
flag notifies the necessity of separete GetCurrentTimestamp().

I understand the reason for startpointTLI being stored in WalRcvInfo
but don't understand about primary_has_standby_xmin. It is just moved
from a static variable of XLogWalRcvSendHSFeedback to the struct
member that is modifed and read only by the same function.

The enum item WALRCV_WAKEUP_TIMEOUT looks somewhat uninformative. How
about naming it WALRCV_WAKEUP_TERMIATE (or WALRCV_WAKEUP_TO_DIE)?

WALRCV_WAKEUP_TIMEOUT and WALRCV_WAKEUP_PING doesn't fire when
wal_receiver_timeout is zero.  In that case we should not set the
timeouts at all to avoid spurious wakeups?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Suppressing useless wakeups in walreceiver

From

Thomas Munro

Date:

28 January 2022, 09:41:32

On Fri, Jan 28, 2022 at 8:26 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> At Thu, 27 Jan 2022 23:50:04 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in
> > While working on WaitEventSet-ifying various codepaths, I found it
> > strange that walreceiver wakes up 10 times per second while idle.
> > Here's a draft patch to compute the correct sleep time.
>
> Agree to the objective.  However I feel the patch makes the code
> somewhat less readable maybe because WalRcvComputeNextWakeup hides the
> timeout deatils.  Of course other might thing differently.

Thanks for looking!

The reason why I put the timeout computation into a function is
because there are about 3 places you need to do it: at the beginning,
when rescheduling for next time, and when the configuration file
changes and you might want to recompute them.  The logic to decode the
GUCs and compute the times would be duplicated.  I have added a
comment about that, and tried to make the code clearer.

Do you have another idea?

> -              ping_sent = false;
> -              XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1,
> -                         startpointTLI);
> +              WalRcvComputeNextWakeup(&state,
> +                          WALRCV_WAKEUP_TIMEOUT,
> +                          last_recv_timestamp);
> +              WalRcvComputeNextWakeup(&state,
> +                          WALRCV_WAKEUP_PING,
> +                          last_recv_timestamp);
>
> The calculated target times are not used within the closest loop and
> the loop is quite busy. WALRCV_WAKEUP_PING is the replacement of
> ping_sent, but I think the computation of both two target times would
> be better done after the loop only when the "if (len > 0)" block was
> passed.

Those two wakeup times should only be adjusted when data is received.
The calls should be exactly where last_recv_timestamp is set in
master, no?

> -          XLogWalRcvSendReply(false, false);
> +          XLogWalRcvSendReply(&state, GetCurrentTimestamp(),
> +                    false, false);
>
> The GetCurrentTimestamp() is same with last_recv_timestamp when the
> recent recv() had any bytes received. So we can avoid the call to
> GetCurrentTimestamp in that case. If we do the change above, the same
> flag notifies the necessity of separete GetCurrentTimestamp().

Yeah, I agree that we should remove more GetCurrentTimestamp() calls.
Here's another idea: let's remove last_recv_timestamp, and just use
'now', being careful to update it after sleeping and in the receive
loop (in case it gets busy for a long time), so that it's always fresh
enough.

> I understand the reason for startpointTLI being stored in WalRcvInfo
> but don't understand about primary_has_standby_xmin. It is just moved
> from a static variable of XLogWalRcvSendHSFeedback to the struct
> member that is modifed and read only by the same function.

Yeah, this was unnecessary refactoring.  I removed that change (we
could look into moving more state into the new state object instead of
using static locals and globals, but that can be for another thread, I
got carried away...)

> The enum item WALRCV_WAKEUP_TIMEOUT looks somewhat uninformative. How
> about naming it WALRCV_WAKEUP_TERMIATE (or WALRCV_WAKEUP_TO_DIE)?

Ok, let's try TERMINATE.

> WALRCV_WAKEUP_TIMEOUT and WALRCV_WAKEUP_PING doesn't fire when
> wal_receiver_timeout is zero.  In that case we should not set the
> timeouts at all to avoid spurious wakeups?

Oh, yeah, I forgot to handle wal_receiver_timeout = 0.  Fixed.

Attachment

v2-0001-Suppress-useless-wakeups-in-walreceiver.patch

Re: Suppressing useless wakeups in walreceiver

From

Kyotaro Horiguchi

Date:

31 January 2022, 07:28:11

At Fri, 28 Jan 2022 22:41:32 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in 
> On Fri, Jan 28, 2022 at 8:26 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > At Thu, 27 Jan 2022 23:50:04 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in
> The reason why I put the timeout computation into a function is
> because there are about 3 places you need to do it: at the beginning,
> when rescheduling for next time, and when the configuration file
> changes and you might want to recompute them.  The logic to decode the
> GUCs and compute the times would be duplicated.  I have added a
> comment about that, and tried to make the code clearer.
> 
> Do you have another idea?

Yeah, I understand that and am having no better ideas so I'm fine with
that.

> > -              ping_sent = false;
> > -              XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1,
> > -                         startpointTLI);
> > +              WalRcvComputeNextWakeup(&state,
> > +                          WALRCV_WAKEUP_TIMEOUT,
> > +                          last_recv_timestamp);
> > +              WalRcvComputeNextWakeup(&state,
> > +                          WALRCV_WAKEUP_PING,
> > +                          last_recv_timestamp);
> >
> > The calculated target times are not used within the closest loop and
> > the loop is quite busy. WALRCV_WAKEUP_PING is the replacement of
> > ping_sent, but I think the computation of both two target times would
> > be better done after the loop only when the "if (len > 0)" block was
> > passed.
> 
> Those two wakeup times should only be adjusted when data is received.

Yes.

> The calls should be exactly where last_recv_timestamp is set in
> master, no?

The calcuations at other than the last one iteration are waste of
cycles and I thought the loop as a kind of tight-loop.  So I don't
mind we consider the additional cycles are acceptable.  (Just I'm not
sure.)

> > -          XLogWalRcvSendReply(false, false);
> > +          XLogWalRcvSendReply(&state, GetCurrentTimestamp(),
> > +                    false, false);
> >
> > The GetCurrentTimestamp() is same with last_recv_timestamp when the
> > recent recv() had any bytes received. So we can avoid the call to
> > GetCurrentTimestamp in that case. If we do the change above, the same
> > flag notifies the necessity of separete GetCurrentTimestamp().
> 
> Yeah, I agree that we should remove more GetCurrentTimestamp() calls.
> Here's another idea: let's remove last_recv_timestamp, and just use
> 'now', being careful to update it after sleeping and in the receive
> loop (in case it gets busy for a long time), so that it's always fresh
> enough.

I like it.  It could be slightly off but it doesn't matter at all.

> > I understand the reason for startpointTLI being stored in WalRcvInfo
> > but don't understand about primary_has_standby_xmin. It is just moved
> > from a static variable of XLogWalRcvSendHSFeedback to the struct
> > member that is modifed and read only by the same function.
> 
> Yeah, this was unnecessary refactoring.  I removed that change (we
> could look into moving more state into the new state object instead of
> using static locals and globals, but that can be for another thread, I
> got carried away...)
> 
> > The enum item WALRCV_WAKEUP_TIMEOUT looks somewhat uninformative. How
> > about naming it WALRCV_WAKEUP_TERMIATE (or WALRCV_WAKEUP_TO_DIE)?
> 
> Ok, let's try TERMINATE.

Thanks:)

> > WALRCV_WAKEUP_TIMEOUT and WALRCV_WAKEUP_PING doesn't fire when
> > wal_receiver_timeout is zero.  In that case we should not set the
> > timeouts at all to avoid spurious wakeups?
> 
> Oh, yeah, I forgot to handle wal_receiver_timeout = 0.  Fixed.

Yeah, looks fine.

+static void
+WalRcvComputeNextWakeup(WalRcvInfo *state,
+                        WalRcvWakeupReason reason,
+                        TimestampTz now)
+{
+    switch (reason)
+    {
...
+    default:
+        break;
+    }
+}

Mmm.. there's NUM_WALRCV_WAKEUPS.. But I think we don't need to allow
the case.  Isn't it better to replace the break with Assert()?


It might be suitable for another patch, but we can do that
together. If we passed "now" to the function XLogWalRcvProcessMsg, we
would be able to remove further several calls to
GetCurrentTimestamp(), in the function itself and
ProcessWalSndrMessage (the name is improperly abbreviated..).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Suppressing useless wakeups in walreceiver

From

Nathan Bossart

Date:

30 September 2022, 03:51:41

I was surprised to see that this patch was never committed!

On Mon, Jan 31, 2022 at 04:28:11PM +0900, Kyotaro Horiguchi wrote:
> +static void
> +WalRcvComputeNextWakeup(WalRcvInfo *state,
> +                        WalRcvWakeupReason reason,
> +                        TimestampTz now)
> +{
> +    switch (reason)
> +    {
> ...
> +    default:
> +        break;
> +    }
> +}
> 
> Mmm.. there's NUM_WALRCV_WAKEUPS.. But I think we don't need to allow
> the case.  Isn't it better to replace the break with Assert()?

+1

> It might be suitable for another patch, but we can do that
> together. If we passed "now" to the function XLogWalRcvProcessMsg, we
> would be able to remove further several calls to
> GetCurrentTimestamp(), in the function itself and
> ProcessWalSndrMessage (the name is improperly abbreviated..).

While reading the patch, I did find myself wondering if there were some
additional opportunities for reducing the calls to GetCurrentTimestamp().
That was really the only thing that stood out to me.

Thomas, are you planning to continue with this patch?  If not, I don't mind
picking it up.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Suppressing useless wakeups in walreceiver

From

Thomas Munro

Date:

30 September 2022, 04:04:43

On Fri, Sep 30, 2022 at 4:51 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> Thomas, are you planning to continue with this patch?  If not, I don't mind
> picking it up.

Hi Nathan,

Please go ahead!  One thing I had in my notes to check with this patch
is whether there might be a bug due to computing all times in
microseconds, but sleeping for a number of milliseconds without
rounding up (what I mean is that if it's trying to sleep for 900
microseconds, it might finish up sleeping for 0 milliseconds in a
tight loop a few times).

Re: Suppressing useless wakeups in walreceiver

From

Nathan Bossart

Date:

30 September 2022, 20:41:47

On Fri, Sep 30, 2022 at 05:04:43PM +1300, Thomas Munro wrote:
> Please go ahead!  One thing I had in my notes to check with this patch
> is whether there might be a bug due to computing all times in
> microseconds, but sleeping for a number of milliseconds without
> rounding up (what I mean is that if it's trying to sleep for 900
> microseconds, it might finish up sleeping for 0 milliseconds in a
> tight loop a few times).

I think that's right.  If 'next_wakeup - now' is less than 1000, 'nap' will
be 0.  My first instinct is to simply round to the nearest millisecond, but
this might still result in wakeups before the calculated wakeup time, and I
think we ultimately want to nap until >= the wakeup time whenever possible
to avoid tight loops.  So, I agree that it would be better to round up to
the next millisecond whenever nextWakeup > now.  Given the current behavior
is to wake up every 100 milliseconds, an extra millisecond here and there
doesn't seem like it should cause any problems.  And I suppose we should
assume we might nap longer already, anyway.

I do see that the wakeup time for WALRCV_WAKEUP_PING might be set to 'now'
whenever wal_receiver_timeout is 1 millisecond, but I think that's existing
behavior, and it feels like an extreme case that we probably needn't worry
about too much.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Suppressing useless wakeups in walreceiver

From

Nathan Bossart

Date:

04 October 2022, 17:57:55

Here is an updated patch set with the following changes:

* The creation of the struct for non-shared WAL receiver state is moved to
a prerequisite 0001 patch.  This should help ease review of 0002 a bit.

* I updated the nap time calculation to round up to the next millisecond,
as discussed upthread.

* I attempted to minimize the calls to GetCurrentTimestamp().  The WAL
receiver code already calls this function pretty liberally, so I don't know
if this is important, but perhaps it could make a difference for systems
that don't have something like the vDSO to avoid real system calls.

* I removed the 'tli' argument from functions that now have an argument for
the non-shared state struct.  The 'tli' is stored within the struct, so the
extra argument is unnecessary.

One thing that still bugs me a little bit about 0002 is that the calls to
GetCurrentTimestamp() feel a bit scattered and more difficult to reason
about.  AFAICT 0002 keeps 'now' relatively up-to-date, but it seems
possible that a future change could easily disrupt that.  I don't have any
other ideas at the moment, though.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Suppressing useless wakeups in walreceiver

From

Alvaro Herrera

Date:

05 October 2022, 07:57:00

On 2022-Oct-04, Nathan Bossart wrote:

> Here is an updated patch set with the following changes:
> 
> * The creation of the struct for non-shared WAL receiver state is moved to
> a prerequisite 0001 patch.  This should help ease review of 0002 a bit.

I think that would be even better if you moved the API adjustments of
some functions for the new struct to 0001 as well
(XLogWalRcvSendHSFeedback etc).

> * I updated the nap time calculation to round up to the next millisecond,
> as discussed upthread.

I didn't look at this part very carefully, but IIRC walreceiver's
responsivity has a direct impact on latency for sync replicas.  Maybe
it'd be sensible to the definition that the sleep time is rounded down
rather than up?  (At least, for the cases where we have something to do
and not merely continue sleeping.) 

> * I attempted to minimize the calls to GetCurrentTimestamp().  The WAL
> receiver code already calls this function pretty liberally, so I don't know
> if this is important, but perhaps it could make a difference for systems
> that don't have something like the vDSO to avoid real system calls.

Yeah, we are indeed careful about this in most places.  Maybe it's not
particularly important in 2022 and also not particularly important for
walreceiver (again, except in the code paths that affect sync replicas),
but there's no reason not to continue to be careful until we discover
more widely that it doesn't matter.

> * I removed the 'tli' argument from functions that now have an argument for
> the non-shared state struct.  The 'tli' is stored within the struct, so the
> extra argument is unnecessary.

+1  (This could also be in 0001, naturally.)

> One thing that still bugs me a little bit about 0002 is that the calls to
> GetCurrentTimestamp() feel a bit scattered and more difficult to reason
> about.  AFAICT 0002 keeps 'now' relatively up-to-date, but it seems
> possible that a future change could easily disrupt that.  I don't have any
> other ideas at the moment, though.

Hmm.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: Suppressing useless wakeups in walreceiver

From

Nathan Bossart

Date:

05 October 2022, 18:08:09

Thanks for taking a look.

On Wed, Oct 05, 2022 at 09:57:00AM +0200, Alvaro Herrera wrote:
> On 2022-Oct-04, Nathan Bossart wrote:
>> * The creation of the struct for non-shared WAL receiver state is moved to
>> a prerequisite 0001 patch.  This should help ease review of 0002 a bit.
> 
> I think that would be even better if you moved the API adjustments of
> some functions for the new struct to 0001 as well
> (XLogWalRcvSendHSFeedback etc).

I moved as much as I could to 0001 in v4.

>> * I updated the nap time calculation to round up to the next millisecond,
>> as discussed upthread.
> 
> I didn't look at this part very carefully, but IIRC walreceiver's
> responsivity has a direct impact on latency for sync replicas.  Maybe
> it'd be sensible to the definition that the sleep time is rounded down
> rather than up?  (At least, for the cases where we have something to do
> and not merely continue sleeping.) 

The concern is that if we wake up early, we'll end up spinning until the
wake-up time is reached.  Given the current behavior is to sleep for 100ms
at a time, and the tasks in question are governed by wal_receiver_timeout
and wal_receiver_status_interval (which are typically set to several
seconds) it seems reasonably safe to sleep up to an extra ~1ms here and
there without sacrificing responsiveness.  In fact, I imagine this patch
results in a net improvement in responsiveness for these tasks.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Suppressing useless wakeups in walreceiver

From

Bharath Rupireddy

Date:

10 October 2022, 09:52:15

On Wed, Oct 5, 2022 at 11:38 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> I moved as much as I could to 0001 in v4.

Some comments on v4-0002:

1. You might want to use PG_INT64_MAX instead of INT64_MAX for portability?

2. With the below change, the time walreceiver spends in
XLogWalRcvSendReply() is also included for XLogWalRcvSendHSFeedback
right? I think it's a problem given that XLogWalRcvSendReply() can
take a while. Earlier, this wasn't the case, each function calculating
'now' separately. Any reason for changing this behaviour? I know that
GetCurrentTimestamp(); isn't cheaper all the time, but here it might
result in a change in the behaviour.
-            XLogWalRcvSendReply(false, false);
-            XLogWalRcvSendHSFeedback(false);
+            TimestampTz now = GetCurrentTimestamp();
+
+            XLogWalRcvSendReply(state, now, false, false);
+            XLogWalRcvSendHSFeedback(state, now, false);

3. I understand that TimestampTz type is treated as microseconds.
Would you mind having a comment in below places to say why we're
multiplying with 1000 or 1000000 here? Also, do we need 1000L or
1000000L or type cast to
TimestampTz?
+            state->wakeup[reason] = now + (wal_receiver_timeout / 2) * 1000;
+            state->wakeup[reason] = now + wal_receiver_timeout * 1000;
+            state->wakeup[reason] = now +
wal_receiver_status_interval * 1000000;

4. How about simplifying WalRcvComputeNextWakeup() something like below?
static void
WalRcvComputeNextWakeup(WalRcvInfo *state, WalRcvWakeupReason reason,
                        TimestampTz now)
{
    TimestampTz ts = INT64_MAX;

    switch (reason)
    {
    case WALRCV_WAKEUP_TERMINATE:
        if (wal_receiver_timeout > 0)
            ts = now + (TimestampTz) (wal_receiver_timeout * 1000L);
        break;
    case WALRCV_WAKEUP_PING:
        if (wal_receiver_timeout > 0)
            ts = now + (TimestampTz) ((wal_receiver_timeout / 2) * 1000L);
        break;
    case WALRCV_WAKEUP_HSFEEDBACK:
        if (hot_standby_feedback && wal_receiver_status_interval > 0)
            ts = now + (TimestampTz) (wal_receiver_status_interval * 1000000L);
        break;
    case WALRCV_WAKEUP_REPLY:
        if (wal_receiver_status_interval > 0)
            ts = now + (TimestampTz) (wal_receiver_status_interval * 1000000);
        break;
    default:
        /* An error is better here. */
    }

    state->wakeup[reason] = ts;
}

5. Can we move below code snippets to respective static functions for
better readability and code reuse?
This:
+                /* Find the soonest wakeup time, to limit our nap. */
+                nextWakeup = INT64_MAX;
+                for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
+                    nextWakeup = Min(state.wakeup[i], nextWakeup);
+                nap = Max(0, (nextWakeup - now + 999) / 1000);

And this:

+                    now = GetCurrentTimestamp();
+                    for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
+                        WalRcvComputeNextWakeup(&state, i, now);

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Suppressing useless wakeups in walreceiver

From

Nathan Bossart

Date:

10 October 2022, 17:51:14

On Mon, Oct 10, 2022 at 03:22:15PM +0530, Bharath Rupireddy wrote:
> Some comments on v4-0002:

Thanks for taking a look.

> 1. You might want to use PG_INT64_MAX instead of INT64_MAX for portability?

Yes, I used PG_INT64_MAX in v5.

> 2. With the below change, the time walreceiver spends in
> XLogWalRcvSendReply() is also included for XLogWalRcvSendHSFeedback
> right? I think it's a problem given that XLogWalRcvSendReply() can
> take a while. Earlier, this wasn't the case, each function calculating
> 'now' separately. Any reason for changing this behaviour? I know that
> GetCurrentTimestamp(); isn't cheaper all the time, but here it might
> result in a change in the behaviour.

Yes, if XLogWalRcvSendReply() takes a long time, we might defer sending the
hot_standby_feedback message until a later time.  The only reason I changed
this was to avoid extra calls to GetCurrentTimestamp(), which might be
expensive on some platforms.  Outside of the code snippet you pointed out,
I think WalReceiverMain() has a similar problem.  That being said, I'm
generally skeptical that this sort of thing is detrimental given the
current behavior (i.e., wake up every 100ms), the usual values of these
GUCs (i.e., tens of seconds), and the fact that any tasks that are
inappropriately skipped will typically be retried in the next iteration of
the loop.

> 3. I understand that TimestampTz type is treated as microseconds.
> Would you mind having a comment in below places to say why we're
> multiplying with 1000 or 1000000 here? Also, do we need 1000L or
> 1000000L or type cast to
> TimestampTz?

I used INT64CONST() in v5, as that seemed the most portable, but I stopped
short of adding comments to explain all the conversions.  IMO the
conversions are relatively obvious, and the units of the GUCs can be easily
seen in guc_tables.c.

> 4. How about simplifying WalRcvComputeNextWakeup() something like below?

Other than saving a couple lines of code, IMO this doesn't meaningfully
simplify the function or improve readability.

> 5. Can we move below code snippets to respective static functions for
> better readability and code reuse?
> This:
> +                /* Find the soonest wakeup time, to limit our nap. */
> +                nextWakeup = INT64_MAX;
> +                for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
> +                    nextWakeup = Min(state.wakeup[i], nextWakeup);
> +                nap = Max(0, (nextWakeup - now + 999) / 1000);
> 
> And this:
> 
> +                    now = GetCurrentTimestamp();
> +                    for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
> +                        WalRcvComputeNextWakeup(&state, i, now);

This did cross my mind, but I opted to avoid creating new functions because
1) they aren't likely to be reused very much, and 2) I actually think it
might hurt readability by forcing developers to traipse around the file to
figure out what these functions are actually doing.  It's not like it would
save many lines of code, either.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Suppressing useless wakeups in walreceiver

From

Nathan Bossart

Date:

10 October 2022, 18:10:14

On Mon, Oct 10, 2022 at 10:51:14AM -0700, Nathan Bossart wrote:
>> +                /* Find the soonest wakeup time, to limit our nap. */
>> +                nextWakeup = INT64_MAX;
>> +                for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
>> +                    nextWakeup = Min(state.wakeup[i], nextWakeup);
>> +                nap = Max(0, (nextWakeup - now + 999) / 1000);

Hm.  We should probably be more cautious here since nextWakeup is an int64
and nap is an int.  My guess is that we should just set nap to -1 if
nextWakeup > INT_MAX.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Suppressing useless wakeups in walreceiver

From

Nathan Bossart

Date:

11 October 2022, 00:50:59

On Mon, Oct 10, 2022 at 11:10:14AM -0700, Nathan Bossart wrote:
> On Mon, Oct 10, 2022 at 10:51:14AM -0700, Nathan Bossart wrote:
>>> +                /* Find the soonest wakeup time, to limit our nap. */
>>> +                nextWakeup = INT64_MAX;
>>> +                for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
>>> +                    nextWakeup = Min(state.wakeup[i], nextWakeup);
>>> +                nap = Max(0, (nextWakeup - now + 999) / 1000);
> 
> Hm.  We should probably be more cautious here since nextWakeup is an int64
> and nap is an int.  My guess is that we should just set nap to -1 if
> nextWakeup > INT_MAX.

Here's an attempt at fixing that.  I ended up just changing "nap" to an
int64 and then ensuring it's no larger than INT_MAX in the call to
WaitLatchOrSocket().  IIUC we can't use -1 here because WL_TIMEOUT is set.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Suppressing useless wakeups in walreceiver

From

Bharath Rupireddy

Date:

11 October 2022, 01:31:26

On Mon, Oct 10, 2022 at 11:21 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>

Thanks for the updated patches.

> > 2. With the below change, the time walreceiver spends in
> > XLogWalRcvSendReply() is also included for XLogWalRcvSendHSFeedback
> > right? I think it's a problem given that XLogWalRcvSendReply() can
> > take a while. Earlier, this wasn't the case, each function calculating
> > 'now' separately. Any reason for changing this behaviour? I know that
> > GetCurrentTimestamp(); isn't cheaper all the time, but here it might
> > result in a change in the behaviour.
>
> Yes, if XLogWalRcvSendReply() takes a long time, we might defer sending the
> hot_standby_feedback message until a later time.  The only reason I changed
> this was to avoid extra calls to GetCurrentTimestamp(), which might be
> expensive on some platforms.

Agree. However...

> Outside of the code snippet you pointed out,
> I think WalReceiverMain() has a similar problem.  That being said, I'm
> generally skeptical that this sort of thing is detrimental given the
> current behavior (i.e., wake up every 100ms), the usual values of these
> GUCs (i.e., tens of seconds), and the fact that any tasks that are
> inappropriately skipped will typically be retried in the next iteration of
> the loop.

AFICS, the aim of the patch isn't optimizing around
GetCurrentTimestamp() calls and the patch does seem to change quite a
bit of them which may cause a change of the behaviour. I think that
the GetCurrentTimestamp() optimizations can go to 0003 patch in this
thread itself or we can discuss it as a separate thread to seek more
thoughts.

The 'now' in many instances in the patch may not actually represent
the true current time and it includes time taken by other operations
as well. I think this will have problems.

+            XLogWalRcvSendReply(state, now, false, false);
+            XLogWalRcvSendHSFeedback(state, now, false);

+                    now = GetCurrentTimestamp();
+                    for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
+                        WalRcvComputeNextWakeup(&state, i, now);
+                    XLogWalRcvSendHSFeedback(&state, now, true);

+                    now = GetCurrentTimestamp();
+                    for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
+                        WalRcvComputeNextWakeup(&state, i, now);
+                    XLogWalRcvSendHSFeedback(&state, now, true);

> > 4. How about simplifying WalRcvComputeNextWakeup() something like below?
>
> Other than saving a couple lines of code, IMO this doesn't meaningfully
> simplify the function or improve readability.

IMO, the duplicate lines of code aren't necessary in
WalRcvComputeNextWakeup() and that function's code isn't that complex
by removing them.

> > 5. Can we move below code snippets to respective static functions for
> > better readability and code reuse?
> > This:
> > +                /* Find the soonest wakeup time, to limit our nap. */
> > +                nextWakeup = INT64_MAX;
> > +                for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
> > +                    nextWakeup = Min(state.wakeup[i], nextWakeup);
> > +                nap = Max(0, (nextWakeup - now + 999) / 1000);
> >
> > And this:
> >
> > +                    now = GetCurrentTimestamp();
> > +                    for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
> > +                        WalRcvComputeNextWakeup(&state, i, now);
>
> This did cross my mind, but I opted to avoid creating new functions because
> 1) they aren't likely to be reused very much, and 2) I actually think it
> might hurt readability by forcing developers to traipse around the file to
> figure out what these functions are actually doing.  It's not like it would
> save many lines of code, either.

The second snippet is being used twice already. IMO having static
functions (perhaps inline) is much better here.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Suppressing useless wakeups in walreceiver

From

Nathan Bossart

Date:

11 October 2022, 02:50:31

On Tue, Oct 11, 2022 at 07:01:26AM +0530, Bharath Rupireddy wrote:
> On Mon, Oct 10, 2022 at 11:21 PM Nathan Bossart
>> Outside of the code snippet you pointed out,
>> I think WalReceiverMain() has a similar problem.  That being said, I'm
>> generally skeptical that this sort of thing is detrimental given the
>> current behavior (i.e., wake up every 100ms), the usual values of these
>> GUCs (i.e., tens of seconds), and the fact that any tasks that are
>> inappropriately skipped will typically be retried in the next iteration of
>> the loop.
> 
> AFICS, the aim of the patch isn't optimizing around
> GetCurrentTimestamp() calls and the patch does seem to change quite a
> bit of them which may cause a change of the behaviour. I think that
> the GetCurrentTimestamp() optimizations can go to 0003 patch in this
> thread itself or we can discuss it as a separate thread to seek more
> thoughts.
> 
> The 'now' in many instances in the patch may not actually represent
> the true current time and it includes time taken by other operations
> as well. I think this will have problems.

What problems do you think this will cause?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Suppressing useless wakeups in walreceiver

From

Bharath Rupireddy

Date:

11 October 2022, 04:04:25

On Tue, Oct 11, 2022 at 8:20 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> > AFICS, the aim of the patch isn't optimizing around
> > GetCurrentTimestamp() calls and the patch does seem to change quite a
> > bit of them which may cause a change of the behaviour. I think that
> > the GetCurrentTimestamp() optimizations can go to 0003 patch in this
> > thread itself or we can discuss it as a separate thread to seek more
> > thoughts.
> >
> > The 'now' in many instances in the patch may not actually represent
> > the true current time and it includes time taken by other operations
> > as well. I think this will have problems.
>
> What problems do you think this will cause?

now = t1;
XLogWalRcvSendReply() /* say it ran for a minute or so for whatever reasons */
XLogWalRcvSendHSFeedback() /* with patch walrecevier sends hot standby
feedback more often without properly honouring
wal_receiver_status_interval because the 'now' isn't actually the
current time as far as that function is concerned, it is
t1 + XLogWalRcvSendReply()'s time. */

Well, is this really a problem? I'm not sure about that. Let's hear from others.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Suppressing useless wakeups in walreceiver

From

Nathan Bossart

Date:

11 October 2022, 17:52:17

On Tue, Oct 11, 2022 at 09:34:25AM +0530, Bharath Rupireddy wrote:
> now = t1;
> XLogWalRcvSendReply() /* say it ran for a minute or so for whatever reasons */
> XLogWalRcvSendHSFeedback() /* with patch walrecevier sends hot standby
> feedback more often without properly honouring
> wal_receiver_status_interval because the 'now' isn't actually the
> current time as far as that function is concerned, it is
> t1 + XLogWalRcvSendReply()'s time. */
> 
> Well, is this really a problem? I'm not sure about that. Let's hear from others.

For this example, the feedback message would just be sent in the next loop
iteration instead.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Suppressing useless wakeups in walreceiver

From

Bharath Rupireddy

Date:

13 October 2022, 10:05:14

On Tue, Oct 11, 2022 at 11:22 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> On Tue, Oct 11, 2022 at 09:34:25AM +0530, Bharath Rupireddy wrote:
> > now = t1;
> > XLogWalRcvSendReply() /* say it ran for a minute or so for whatever reasons */
> > XLogWalRcvSendHSFeedback() /* with patch walrecevier sends hot standby
> > feedback more often without properly honouring
> > wal_receiver_status_interval because the 'now' isn't actually the
> > current time as far as that function is concerned, it is
> > t1 + XLogWalRcvSendReply()'s time. */
> >
> > Well, is this really a problem? I'm not sure about that. Let's hear from others.
>
> For this example, the feedback message would just be sent in the next loop
> iteration instead.

I think the hot standby feedback message gets sent too frequently
without honouring the wal_receiver_status_interval because the 'now'
is actually not the current time with your patch but 'now +
XLogWalRcvSendReply()'s time'. However, it's possible that I may be
wrong here.

        /*
         * Send feedback at most once per wal_receiver_status_interval.
         */
        if (!TimestampDifferenceExceeds(sendTime, now,
                                        wal_receiver_status_interval * 1000))
            return;

As said upthread [1], I think the best way to move forward is to
separate the GetCurrentTimestamp() calls optimizations into 0003.

Do you have any further thoughts on review comments posted upthread [1]?

[1] https://www.postgresql.org/message-id/CALj2ACV5q2dbVRKwu2PL2s_YY0owZFTRxLdX%3Dt%2BdZ1iag15khA%40mail.gmail.com

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Suppressing useless wakeups in walreceiver

From

Alvaro Herrera

Date:

13 October 2022, 10:37:39

I think in 0001 we should put more stuff in the state struct --
specifically these globals:

static int    recvFile = -1;
static TimeLineID recvFileTLI = 0;
static XLogSegNo recvSegNo = 0;

The main reason is that it seems odd to have startpointTLI in the struct
used in some places together with a file-global recvFileTLI which isn't.
The way one is passed as argument and the other as part of a struct
seemed too distracting.  This should reduce the number of moving parts,
ISTM.


One thing that confused me for a moment is that we have some state in
walrcv and some more state in 'state'.  The difference is pretty obvious
once you look at the other, but it suggest to me that a better variable
name for the latter is 'localstate' to more obviously distinguish them.


I was tempted to suggest that LogstreamResult would also be good to have
in the new struct, but that might be going a bit too far for a first
cut.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: Suppressing useless wakeups in walreceiver

From

Nathan Bossart

Date:

13 October 2022, 17:45:06

On Thu, Oct 13, 2022 at 03:35:14PM +0530, Bharath Rupireddy wrote:
> I think the hot standby feedback message gets sent too frequently
> without honouring the wal_receiver_status_interval because the 'now'
> is actually not the current time with your patch but 'now +
> XLogWalRcvSendReply()'s time'. However, it's possible that I may be
> wrong here.

Is your concern that the next wakeup time will be miscalculated because of
a stale value in 'now'?  I think that's existing behavior, as we save the
current time before sending the message in XLogWalRcvSendReply and
XLogWalRcvSendHSFeedback.  The only way to completely avoid this would be
to call GetCurrentTimestamp() after the message has been sent and to use
that value to calculate when the next message should be sent.  However, I
am skeptical there's really any practical impact.  Is it really a problem
if a message that should be sent every 30 seconds is sent at an interval of
29.9 seconds sometimes?  I think it could go the other way, too.  If 'now'
is stale, we might decide not to send a feedback message until the next
loop iteration, so the interval might be 30.1 seconds.  I don't think it's
realistic to expect complete accuracy here.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Suppressing useless wakeups in walreceiver

From

Nathan Bossart

Date:

13 October 2022, 19:09:54

On Thu, Oct 13, 2022 at 12:37:39PM +0200, Alvaro Herrera wrote:
> I think in 0001 we should put more stuff in the state struct --
> specifically these globals:
> 
> static int    recvFile = -1;
> static TimeLineID recvFileTLI = 0;
> static XLogSegNo recvSegNo = 0;
> 
> The main reason is that it seems odd to have startpointTLI in the struct
> used in some places together with a file-global recvFileTLI which isn't.
> The way one is passed as argument and the other as part of a struct
> seemed too distracting.  This should reduce the number of moving parts,
> ISTM.

Makes sense.  Do you think the struct should be file-global so that it
doesn't need to be provided as an argument to most of the static functions
in this file?

> One thing that confused me for a moment is that we have some state in
> walrcv and some more state in 'state'.  The difference is pretty obvious
> once you look at the other, but it suggest to me that a better variable
> name for the latter is 'localstate' to more obviously distinguish them.

Sure, I'll change it to 'localstate'.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Suppressing useless wakeups in walreceiver

From

Nathan Bossart

Date:

16 October 2022, 03:59:00

On Thu, Oct 13, 2022 at 12:09:54PM -0700, Nathan Bossart wrote:
> On Thu, Oct 13, 2022 at 12:37:39PM +0200, Alvaro Herrera wrote:
>> The main reason is that it seems odd to have startpointTLI in the struct
>> used in some places together with a file-global recvFileTLI which isn't.
>> The way one is passed as argument and the other as part of a struct
>> seemed too distracting.  This should reduce the number of moving parts,
>> ISTM.
> 
> Makes sense.  Do you think the struct should be file-global so that it
> doesn't need to be provided as an argument to most of the static functions
> in this file?

Here's a different take.  Instead of creating structs and altering function
signatures, we can just make the wake-up times file-global, and we can skip
the changes related to reducing the number of calls to
GetCurrentTimestamp() for now.  This results in a far less invasive patch.
(I still think reducing the number of calls to GetCurrentTimestamp() is
worthwhile, but I'm beginning to agree with Bharath that it should be
handled separately.)

Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

v7-0001-Suppress-useless-wakeups-in-walreceiver.patch

Re: Suppressing useless wakeups in walreceiver

From

Kyotaro Horiguchi

Date:

17 October 2022, 06:21:18

Thanks for taking this up.

At Sat, 15 Oct 2022 20:59:00 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in 
> On Thu, Oct 13, 2022 at 12:09:54PM -0700, Nathan Bossart wrote:
> > On Thu, Oct 13, 2022 at 12:37:39PM +0200, Alvaro Herrera wrote:
> >> The main reason is that it seems odd to have startpointTLI in the struct
> >> used in some places together with a file-global recvFileTLI which isn't.
> >> The way one is passed as argument and the other as part of a struct
> >> seemed too distracting.  This should reduce the number of moving parts,
> >> ISTM.
> > 
> > Makes sense.  Do you think the struct should be file-global so that it
> > doesn't need to be provided as an argument to most of the static functions
> > in this file?
> 
> Here's a different take.  Instead of creating structs and altering function
> signatures, we can just make the wake-up times file-global, and we can skip

I'm fine with that. Rather it seems to me really simple.

> the changes related to reducing the number of calls to
> GetCurrentTimestamp() for now.  This results in a far less invasive patch.

Sure.

> (I still think reducing the number of calls to GetCurrentTimestamp() is
> worthwhile, but I'm beginning to agree with Bharath that it should be
> handled separately.)

+1.  We would gain lesser for the required effort.

> Thoughts?

Now that I see the fix for the implicit conversion:

L527: +                nap = Max(0, (nextWakeup - now + 999) / 1000);
..
L545: +                                       (int) Min(INT_MAX, nap),


I think limiting the naptime at use is confusing. Don't we place these
adjacently?  Then we could change the nap to an integer.  Or we can
just assert out for the nap time longer than INT_MAX (but this would
require another int64 variable. I belive we won't need such a long
nap, (or that is no longer a nap?)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Suppressing useless wakeups in walreceiver

From

Bharath Rupireddy

Date:

26 October 2022, 09:35:20

On Sun, Oct 16, 2022 at 9:29 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> Here's a different take.  Instead of creating structs and altering function
> signatures, we can just make the wake-up times file-global, and we can skip
> the changes related to reducing the number of calls to
> GetCurrentTimestamp() for now.  This results in a far less invasive patch.
> (I still think reducing the number of calls to GetCurrentTimestamp() is
> worthwhile, but I'm beginning to agree with Bharath that it should be
> handled separately.)
>
> Thoughts?

Thanks. I'm wondering if we need to extend the similar approach for
logical replication workers in logical/worker.c LogicalRepApplyLoop()?

A comment on the patch:
Isn't it better we track the soonest wakeup time or min of wakeup[]
array (update the min whenever the array element is updated in
WalRcvComputeNextWakeup()) instead of recomputing everytime by looping
for NUM_WALRCV_WAKEUPS times? I think this will save us some CPU
cycles, because the for loop, in which the below code is placed, runs
till the walreceiver end of life cycle. We may wrap wakeup[] and min
inside a structure for better code organization or just add min as
another static variable alongside wakeup[].

+                /* Find the soonest wakeup time, to limit our nap. */
+                nextWakeup = PG_INT64_MAX;
+                for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
+                    nextWakeup = Min(wakeup[i], nextWakeup);
+                nap = Max(0, (nextWakeup - now + 999) / 1000);

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Suppressing useless wakeups in walreceiver

From

Nathan Bossart

Date:

05 November 2022, 22:00:55

On Mon, Oct 17, 2022 at 03:21:18PM +0900, Kyotaro Horiguchi wrote:
> Now that I see the fix for the implicit conversion:
> 
> L527: +                nap = Max(0, (nextWakeup - now + 999) / 1000);
> ..
> L545: +                                       (int) Min(INT_MAX, nap),
> 
> 
> I think limiting the naptime at use is confusing. Don't we place these
> adjacently?  Then we could change the nap to an integer.  Or we can
> just assert out for the nap time longer than INT_MAX (but this would
> require another int64 variable. I belive we won't need such a long
> nap, (or that is no longer a nap?)

Yeah, I guess this deserves a comment.  I could also combine it easily:

    nap = (int) Min(INT_MAX, Max(0, (nextWakeup - now + 999) / 1000));

We could probably just remove the WL_TIMEOUT flag and set timeout to -1
whenever "nap" is calculated to be > INT_MAX, but I didn't think it was
worth complicating the code in order to save an extra wakeup every ~25
days.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Suppressing useless wakeups in walreceiver

From

Nathan Bossart

Date:

05 November 2022, 22:11:40

On Wed, Oct 26, 2022 at 03:05:20PM +0530, Bharath Rupireddy wrote:
> A comment on the patch:
> Isn't it better we track the soonest wakeup time or min of wakeup[]
> array (update the min whenever the array element is updated in
> WalRcvComputeNextWakeup()) instead of recomputing everytime by looping
> for NUM_WALRCV_WAKEUPS times? I think this will save us some CPU
> cycles, because the for loop, in which the below code is placed, runs
> till the walreceiver end of life cycle. We may wrap wakeup[] and min
> inside a structure for better code organization or just add min as
> another static variable alongside wakeup[].
> 
> +                /* Find the soonest wakeup time, to limit our nap. */
> +                nextWakeup = PG_INT64_MAX;
> +                for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
> +                    nextWakeup = Min(wakeup[i], nextWakeup);
> +                nap = Max(0, (nextWakeup - now + 999) / 1000);

While that might save a few CPU cycles when computing the nap time, I don't
think it's worth the added complexity and CPU cycles in
WalRcvComputeNextWakeup().  I suspect it'd be difficult to demonstrate any
meaningful difference between the two approaches.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Suppressing useless wakeups in walreceiver

From

Nathan Bossart

Date:

05 November 2022, 23:01:15

On Sat, Nov 05, 2022 at 03:00:55PM -0700, Nathan Bossart wrote:
> On Mon, Oct 17, 2022 at 03:21:18PM +0900, Kyotaro Horiguchi wrote:
>> Now that I see the fix for the implicit conversion:
>> 
>> L527: +                nap = Max(0, (nextWakeup - now + 999) / 1000);
>> ..
>> L545: +                                       (int) Min(INT_MAX, nap),
>> 
>> 
>> I think limiting the naptime at use is confusing. Don't we place these
>> adjacently?  Then we could change the nap to an integer.  Or we can
>> just assert out for the nap time longer than INT_MAX (but this would
>> require another int64 variable. I belive we won't need such a long
>> nap, (or that is no longer a nap?)
> 
> Yeah, I guess this deserves a comment.  I could also combine it easily:
> 
>     nap = (int) Min(INT_MAX, Max(0, (nextWakeup - now + 999) / 1000));

Here is a new version of the patch that addresses this feedback.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

v8-0001-Suppress-useless-wakeups-in-walreceiver.patch

Re: Suppressing useless wakeups in walreceiver

From

Thomas Munro

Date:

08 November 2022, 02:20:22

On Sun, Nov 6, 2022 at 12:01 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> Here is a new version of the patch that addresses this feedback.

This looks pretty good to me.  Thanks for picking it up!  I can live
with the change to use a global variable; it seems we couldn't quite
decide what the scope was for moving stuff into a struct (ie various
pre-existing stuff representing the walreceiver's state), but I'm OK
with considering that as a pure refactoring project for a separate
thread.  That array should be "static", though.

Re: Suppressing useless wakeups in walreceiver

From

Thomas Munro

Date:

08 November 2022, 07:46:26

On Tue, Nov 8, 2022 at 3:20 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sun, Nov 6, 2022 at 12:01 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> > Here is a new version of the patch that addresses this feedback.
>
> This looks pretty good to me.  Thanks for picking it up!  I can live
> with the change to use a global variable; it seems we couldn't quite
> decide what the scope was for moving stuff into a struct (ie various
> pre-existing stuff representing the walreceiver's state), but I'm OK
> with considering that as a pure refactoring project for a separate
> thread.  That array should be "static", though.

And with that change and a pgindent, pushed.

Re: Suppressing useless wakeups in walreceiver

From

Bharath Rupireddy

Date:

08 November 2022, 08:20:25

On Tue, Nov 8, 2022 at 1:17 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> And with that change and a pgindent, pushed.

Thanks. Do we need a similar wakeup approach for logical replication
workers in worker.c? Or is it okay that the nap time is 1sec there?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Suppressing useless wakeups in walreceiver

From

Thomas Munro

Date:

08 November 2022, 08:45:40

On Tue, Nov 8, 2022 at 9:20 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> Thanks. Do we need a similar wakeup approach for logical replication
> workers in worker.c? Or is it okay that the nap time is 1sec there?

Yeah, I think so.  At least for its idle/nap case (waiting for flush
is also a technically interesting case, but harder, and applies to
non-idle systems so the polling is a little less offensive).

Re: Suppressing useless wakeups in walreceiver

From

Nathan Bossart

Date:

08 November 2022, 21:06:39

On Tue, Nov 08, 2022 at 08:46:26PM +1300, Thomas Munro wrote:
> On Tue, Nov 8, 2022 at 3:20 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> This looks pretty good to me.  Thanks for picking it up!  I can live
>> with the change to use a global variable; it seems we couldn't quite
>> decide what the scope was for moving stuff into a struct (ie various
>> pre-existing stuff representing the walreceiver's state), but I'm OK
>> with considering that as a pure refactoring project for a separate
>> thread.  That array should be "static", though.
> 
> And with that change and a pgindent, pushed.

Thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Suppressing useless wakeups in walreceiver

From

Nathan Bossart

Date:

08 November 2022, 21:08:26

On Tue, Nov 08, 2022 at 09:45:40PM +1300, Thomas Munro wrote:
> On Tue, Nov 8, 2022 at 9:20 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
>> Thanks. Do we need a similar wakeup approach for logical replication
>> workers in worker.c? Or is it okay that the nap time is 1sec there?
> 
> Yeah, I think so.  At least for its idle/nap case (waiting for flush
> is also a technically interesting case, but harder, and applies to
> non-idle systems so the polling is a little less offensive).

Bharath, are you planning to pick this up?  If not, I can probably spend
some time on it.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Suppressing useless wakeups in walreceiver

From

Bharath Rupireddy

Date:

09 November 2022, 04:52:37

On Wed, Nov 9, 2022 at 2:38 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Tue, Nov 08, 2022 at 09:45:40PM +1300, Thomas Munro wrote:
> > On Tue, Nov 8, 2022 at 9:20 PM Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> >> Thanks. Do we need a similar wakeup approach for logical replication
> >> workers in worker.c? Or is it okay that the nap time is 1sec there?
> >
> > Yeah, I think so.  At least for its idle/nap case (waiting for flush
> > is also a technically interesting case, but harder, and applies to
> > non-idle systems so the polling is a little less offensive).
>
> Bharath, are you planning to pick this up?  If not, I can probably spend
> some time on it.

Please go ahead!

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Suppressing useless wakeups in walreceiver

From

Tom Lane

Date:

13 November 2022, 22:08:04

Thomas Munro <thomas.munro@gmail.com> writes:
> And with that change and a pgindent, pushed.

There is something very seriously wrong with this patch.

On my machine, running "make -j10 check-world" (with compilation
already done) has been taking right about 2 minutes for some time.
Since this patch, it's taking around 2:45 --- I did a bisect run
to confirm that this patch is where it changed.

The buildfarm is showing a hit, too.  Comparing the step runtimes at

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2022-11-08%2005%3A29%3A28
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2022-11-08%2007%3A49%3A31

it'd seem that most tests involving walreceivers got much slower:
pg_basebackup-check from 00:29 to 00:39,
pg_rewind-check went from 00:56 to 01:26,
and recovery-check went from 03:56 to 04:45.
Curiously, subscription-check only went from 03:26 to 03:29.

I've not dug into it further than that, but my bet is that some
required wakeup condition was not accounted for.

            regards, tom lane

Re: Suppressing useless wakeups in walreceiver

From

Thomas Munro

Date:

13 November 2022, 22:18:13

On Mon, Nov 14, 2022 at 11:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> There is something very seriously wrong with this patch.
>
> On my machine, running "make -j10 check-world" (with compilation
> already done) has been taking right about 2 minutes for some time.
> Since this patch, it's taking around 2:45 --- I did a bisect run
> to confirm that this patch is where it changed.
>
> The buildfarm is showing a hit, too.  Comparing the step runtimes at
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2022-11-08%2005%3A29%3A28
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2022-11-08%2007%3A49%3A31
>
> it'd seem that most tests involving walreceivers got much slower:
> pg_basebackup-check from 00:29 to 00:39,
> pg_rewind-check went from 00:56 to 01:26,
> and recovery-check went from 03:56 to 04:45.
> Curiously, subscription-check only went from 03:26 to 03:29.
>
> I've not dug into it further than that, but my bet is that some
> required wakeup condition was not accounted for.

Looking...

Re: Suppressing useless wakeups in walreceiver

From

Nathan Bossart

Date:

13 November 2022, 22:26:44

On Sun, Nov 13, 2022 at 05:08:04PM -0500, Tom Lane wrote:
> There is something very seriously wrong with this patch.
> 
> On my machine, running "make -j10 check-world" (with compilation
> already done) has been taking right about 2 minutes for some time.
> Since this patch, it's taking around 2:45 --- I did a bisect run
> to confirm that this patch is where it changed.

I've been looking into this.  I wrote a similar patch for logical/worker.c
before noticing that check-world was taking much longer.  The problem in
that case seems to be that process_syncing_tables() isn't called as often.
It wouldn't surprise me if there's also something in walreceiver.c that
depends upon the frequent wakeups.  I suspect this will require a revert.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Suppressing useless wakeups in walreceiver

From

Thomas Munro

Date:

13 November 2022, 23:11:19

On Mon, Nov 14, 2022 at 11:26 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
> On Sun, Nov 13, 2022 at 05:08:04PM -0500, Tom Lane wrote:
> > There is something very seriously wrong with this patch.
> >
> > On my machine, running "make -j10 check-world" (with compilation
> > already done) has been taking right about 2 minutes for some time.
> > Since this patch, it's taking around 2:45 --- I did a bisect run
> > to confirm that this patch is where it changed.
>
> I've been looking into this.  I wrote a similar patch for logical/worker.c
> before noticing that check-world was taking much longer.  The problem in
> that case seems to be that process_syncing_tables() isn't called as often.
> It wouldn't surprise me if there's also something in walreceiver.c that
> depends upon the frequent wakeups.  I suspect this will require a revert.

In the case of "meson test pg_basebackup/020_pg_receivewal" it looks
like wait_for_catchup hangs around for 10 seconds waiting for HS
feedback.  I'm wondering if we need to go back to high frequency
wakeups until it's caught up, or (probably better) arrange for a
proper event for progress.  Digging...

Re: Suppressing useless wakeups in walreceiver

From

Thomas Munro

Date:

13 November 2022, 23:35:28

On Mon, Nov 14, 2022 at 12:11 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Mon, Nov 14, 2022 at 11:26 AM Nathan Bossart
> <nathandbossart@gmail.com> wrote:
> > On Sun, Nov 13, 2022 at 05:08:04PM -0500, Tom Lane wrote:
> > > There is something very seriously wrong with this patch.
> > >
> > > On my machine, running "make -j10 check-world" (with compilation
> > > already done) has been taking right about 2 minutes for some time.
> > > Since this patch, it's taking around 2:45 --- I did a bisect run
> > > to confirm that this patch is where it changed.
> >
> > I've been looking into this.  I wrote a similar patch for logical/worker.c
> > before noticing that check-world was taking much longer.  The problem in
> > that case seems to be that process_syncing_tables() isn't called as often.
> > It wouldn't surprise me if there's also something in walreceiver.c that
> > depends upon the frequent wakeups.  I suspect this will require a revert.
>
> In the case of "meson test pg_basebackup/020_pg_receivewal" it looks
> like wait_for_catchup hangs around for 10 seconds waiting for HS
> feedback.  I'm wondering if we need to go back to high frequency
> wakeups until it's caught up, or (probably better) arrange for a
> proper event for progress.  Digging...

Maybe there is a better way to code this (I mean, who likes global
variables?) and I need to test some more, but I suspect the attached
is approximately what we missed.

Attachment

more-replies.patch

Re: Suppressing useless wakeups in walreceiver

From

Thomas Munro

Date:

13 November 2022, 23:51:14

On Mon, Nov 14, 2022 at 12:35 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Maybe there is a better way to code this (I mean, who likes global
> variables?) and I need to test some more, but I suspect the attached
> is approximately what we missed.

Erm, ENOCOFFEE, sorry that's not quite right, it needs the apply LSN,
but it demonstrates what the problem is...

Re: Suppressing useless wakeups in walreceiver

From

Nathan Bossart

Date:

14 November 2022, 00:05:31

On Mon, Nov 14, 2022 at 12:51:14PM +1300, Thomas Munro wrote:
> On Mon, Nov 14, 2022 at 12:35 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> Maybe there is a better way to code this (I mean, who likes global
>> variables?) and I need to test some more, but I suspect the attached
>> is approximately what we missed.
> 
> Erm, ENOCOFFEE, sorry that's not quite right, it needs the apply LSN,
> but it demonstrates what the problem is...

Yeah, I'm able to sort-of reproduce the problem by sending fewer replies,
but it's not clear to me why that's the problem.  AFAICT all of the code
paths that write/flush are careful to send a reply shortly afterward, which
should keep writePtr/flushPtr updated.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Suppressing useless wakeups in walreceiver

From

Thomas Munro

Date:

14 November 2022, 01:47:14

On Mon, Nov 14, 2022 at 1:05 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> Yeah, I'm able to sort-of reproduce the problem by sending fewer replies,
> but it's not clear to me why that's the problem.  AFAICT all of the code
> paths that write/flush are careful to send a reply shortly afterward, which
> should keep writePtr/flushPtr updated.

Ahh, I think I might have it.  In the old coding, sendTime starts out
as 0, which I think caused it to send its first reply message after
the first 100ms sleep, and only after that fall into a cadence of
wal_receiver_status_interval (10s) while idle.  Our new coding won't
send its first reply until start time + wal_receiver_status_interval.
If I have that right, think we can get back to the previous behaviour
by explicitly setting the first message time, like:

@@ -433,6 +433,9 @@ WalReceiverMain(void)
            for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
                WalRcvComputeNextWakeup(i, now);

+           /* XXX start with a reply after 100ms */
+           wakeup[WALRCV_WAKEUP_REPLY] = now + 100000;
+
            /* Loop until end-of-streaming or error */

Obviously that's bogus and racy (it races with wait_for_catchup, and
it's slow, actually both sides are not great and really should be
event-driven, in later work)...

Re: Suppressing useless wakeups in walreceiver

From

Nathan Bossart

Date:

14 November 2022, 19:01:27

On Mon, Nov 14, 2022 at 02:47:14PM +1300, Thomas Munro wrote:
> Ahh, I think I might have it.  In the old coding, sendTime starts out
> as 0, which I think caused it to send its first reply message after
> the first 100ms sleep, and only after that fall into a cadence of
> wal_receiver_status_interval (10s) while idle.  Our new coding won't
> send its first reply until start time + wal_receiver_status_interval.
> If I have that right, think we can get back to the previous behaviour
> by explicitly setting the first message time, like:

Good find!

> @@ -433,6 +433,9 @@ WalReceiverMain(void)
>             for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
>                 WalRcvComputeNextWakeup(i, now);
> 
> +           /* XXX start with a reply after 100ms */
> +           wakeup[WALRCV_WAKEUP_REPLY] = now + 100000;
> +
>             /* Loop until end-of-streaming or error */
> 
> Obviously that's bogus and racy (it races with wait_for_catchup, and
> it's slow, actually both sides are not great and really should be
> event-driven, in later work)...

Is there any reason we should wait for 100ms before sending the initial
reply?  ISTM the previous behavior essentially caused the first reply (and
feedback message) to be sent at the first opportunity after streaming
begins.  My instinct is to do something like the attached.  I wonder if we
ought to do something similar in the ConfigReloadPending path in case
hot_standby_feedback is being turned on.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

send_initial_reply.patch

Re: Suppressing useless wakeups in walreceiver

From

Thomas Munro

Date:

14 November 2022, 20:42:26

On Tue, Nov 15, 2022 at 8:01 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> Is there any reason we should wait for 100ms before sending the initial
> reply?  ISTM the previous behavior essentially caused the first reply (and
> feedback message) to be sent at the first opportunity after streaming
> begins.  My instinct is to do something like the attached.  I wonder if we
> ought to do something similar in the ConfigReloadPending path in case
> hot_standby_feedback is being turned on.

That works for 020_pg_receivewal.  I wonder if there are also tests
that stream a bit of WAL first and then do wait_for_catchup that were
previously benefiting from the 100ms-after-startup message by
scheduling luck (as in, that was usually enough for replay)?  I might
go and teach Cluster.pm to log how much time is wasted in
wait_for_catchup to get some observability, and then try to figure out
how to optimise it properly.  We could perhaps put the 100ms duct tape
back temporarily though, if necessary.

Re: Suppressing useless wakeups in walreceiver

From

Nathan Bossart

Date:

14 November 2022, 23:14:39

On Tue, Nov 15, 2022 at 09:42:26AM +1300, Thomas Munro wrote:
> That works for 020_pg_receivewal.  I wonder if there are also tests
> that stream a bit of WAL first and then do wait_for_catchup that were
> previously benefiting from the 100ms-after-startup message by
> scheduling luck (as in, that was usually enough for replay)?  I might
> go and teach Cluster.pm to log how much time is wasted in
> wait_for_catchup to get some observability, and then try to figure out
> how to optimise it properly.  We could perhaps put the 100ms duct tape
> back temporarily though, if necessary.

Oh, I see.  Since we don't check the apply position when determining
whether to send a reply, tests may need to wait a full
wal_receiver_status_interval.  FWIW with my patch, the runtime of the
src/test/recovery tests seems to be back to what it was on my machine, but
I certainly wouldn't rule out scheduling luck.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Suppressing useless wakeups in walreceiver

From

Thomas Munro

Date:

14 November 2022, 23:26:00

On Tue, Nov 15, 2022 at 12:14 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
> On Tue, Nov 15, 2022 at 09:42:26AM +1300, Thomas Munro wrote:
> > That works for 020_pg_receivewal.  I wonder if there are also tests
> > that stream a bit of WAL first and then do wait_for_catchup that were
> > previously benefiting from the 100ms-after-startup message by
> > scheduling luck (as in, that was usually enough for replay)?  I might
> > go and teach Cluster.pm to log how much time is wasted in
> > wait_for_catchup to get some observability, and then try to figure out
> > how to optimise it properly.  We could perhaps put the 100ms duct tape
> > back temporarily though, if necessary.
>
> Oh, I see.  Since we don't check the apply position when determining
> whether to send a reply, tests may need to wait a full
> wal_receiver_status_interval.  FWIW with my patch, the runtime of the
> src/test/recovery tests seems to be back to what it was on my machine, but
> I certainly wouldn't rule out scheduling luck.

Yeah, what you have looks good in my experiments.  Here are some
statistics on waits on my system.  Times in seconds, first row is
before the new time logic went in, your patch is the last row.

     name      | count |   sum   |  avg  | stddev |  min  |  max
---------------+-------+---------+-------+--------+-------+--------
 before        |   718 |  37.375 | 0.052 |  0.135 | 0.006 |  1.110
 master        |   718 | 173.100 | 0.241 |  1.374 | 0.006 | 10.004
 initial-100ms |   718 |  37.169 | 0.052 |  0.126 | 0.006 |  0.676
 initial-0ms   |   718 |  35.276 | 0.049 |  0.123 | 0.006 |  0.661

The difference on master is explained by these 14 outliers:

                   name                   |  time
------------------------------------------+--------
 testrun/recovery/019_replslot_limit      | 10.004
 testrun/recovery/033_replay_tsp_drops    |  9.917
 testrun/recovery/033_replay_tsp_drops    |  9.957
 testrun/pg_basebackup/020_pg_receivewal  |  9.899
 testrun/pg_rewind/003_extrafiles         |  9.961
 testrun/pg_rewind/003_extrafiles         |  9.916
 testrun/pg_rewind/008_min_recovery_point |  9.929
 recovery/019_replslot_limit              | 10.004
 recovery/033_replay_tsp_drops            |  9.917
 recovery/033_replay_tsp_drops            |  9.957
 pg_basebackup/020_pg_receivewal          |  9.899
 pg_rewind/003_extrafiles                 |  9.961
 pg_rewind/003_extrafiles                 |  9.916
 pg_rewind/008_min_recovery_point         |  9.929
(14 rows)

Now that I can see what is going on I'm going to try to see how much I
can squeeze out of this...

Re: Suppressing useless wakeups in walreceiver

From

Nathan Bossart

Date:

15 November 2022, 04:49:16

Another option might be to just force initial reply/feedback messages when
streaming starts.  The attached patch improves src/test/recovery test
runtime just like the previous one I posted.

On Mon, Nov 14, 2022 at 11:01:27AM -0800, Nathan Bossart wrote:
> I wonder if we
> ought to do something similar in the ConfigReloadPending path in case
> hot_standby_feedback is being turned on.

Looking closer, I see that a feedback message is forced in this case
already.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

send_initial_reply_v2.patch

Re: Suppressing useless wakeups in walreceiver

From

Thomas Munro

Date:

16 November 2022, 03:57:08

On Tue, Nov 15, 2022 at 5:49 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> Another option might be to just force initial reply/feedback messages when
> streaming starts.  The attached patch improves src/test/recovery test
> runtime just like the previous one I posted.

Yeah, looks good in my tests.  I think we just need to make it
conditional so we don't force it if someone has
wal_receiver_status_interval disabled.

Attachment

0001-Fix-slowdown-in-TAP-tests-due-to-recent-walreceiver-.patch

Re: Suppressing useless wakeups in walreceiver

From

Nathan Bossart

Date:

16 November 2022, 04:24:16

On Wed, Nov 16, 2022 at 04:57:08PM +1300, Thomas Munro wrote:
> On Tue, Nov 15, 2022 at 5:49 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> Another option might be to just force initial reply/feedback messages when
>> streaming starts.  The attached patch improves src/test/recovery test
>> runtime just like the previous one I posted.
> 
> Yeah, looks good in my tests.  I think we just need to make it
> conditional so we don't force it if someone has
> wal_receiver_status_interval disabled.

Yeah, that makes sense.  IIUC setting "force" to false would have the same
effect for the initial round of streaming, but since writePtr/flushPtr will
be set in later rounds, no reply would be guaranteed.  That might be good
enough for the tests, but it seems a bit inconsistent.  So, your patch is
probably the way to go.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Suppressing useless wakeups in walreceiver

From

Thomas Munro

Date:

16 November 2022, 22:31:57

On Wed, Nov 16, 2022 at 5:24 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> On Wed, Nov 16, 2022 at 04:57:08PM +1300, Thomas Munro wrote:
> > On Tue, Nov 15, 2022 at 5:49 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >> Another option might be to just force initial reply/feedback messages when
> >> streaming starts.  The attached patch improves src/test/recovery test
> >> runtime just like the previous one I posted.
> >
> > Yeah, looks good in my tests.  I think we just need to make it
> > conditional so we don't force it if someone has
> > wal_receiver_status_interval disabled.
>
> Yeah, that makes sense.  IIUC setting "force" to false would have the same
> effect for the initial round of streaming, but since writePtr/flushPtr will
> be set in later rounds, no reply would be guaranteed.  That might be good
> enough for the tests, but it seems a bit inconsistent.  So, your patch is
> probably the way to go.

On second thoughts, there's not much point in that, since we always
force replies under various other circumstances anyway.  There isn't
really a 'never send replies' mode.  Committed the way you had it
before.  Thanks!

I can't unsee that we're spending ~35 seconds waiting for catchup in
718 separate wait_for_catchup calls.  The problem is entirely on the
waiting side (we already do WalRcvForceReply() in xlogrecovery.c when
going idle), and there's probably a nice condition variable-based
improvement here but I decided to hop over that rabbit hole today.