Thread: suppressing useless wakeups in logical/worker.c

suppressing useless wakeups in logical/worker.c

From
Nathan Bossart
Date:
Hi hackers,

I've attached an attempt at porting a similar change to 05a7be9 [0] to
logical/worker.c.  The bulk of the patch is lifted from the walreceiver
patch, but I did need to add a hack for waking up after
wal_retrieve_retry_interval to start sync workers.  This hack involves a
new wakeup variable that process_syncing_tables_for_apply() sets.

For best results, this patch should be applied on top of [1], which is an
attempt at fixing all the stuff that only runs within a reasonable
timeframe because logical worker processes currently wake up at least once
a second.  With the attached patch applied, those periodic wakeups are
gone, so we need to make sure we wake up the logical workers as needed.

[0] https://postgr.es/m/CA%2BhUKGJGhX4r2LPUE3Oy9BX71Eum6PBcS8L3sJpScR9oKaTVaA%40mail.gmail.com
[1] https://postgr.es/m/20221122004119.GA132961%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

RE: suppressing useless wakeups in logical/worker.c

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Nathan,

Thank you for making the patch! I tested your patch, and it basically worked well.
About following part:

```
            ConfigReloadPending = false;
             ProcessConfigFile(PGC_SIGHUP);
+            now = GetCurrentTimestamp();
+            for (int i = 0; i < NUM_LRW_WAKEUPS; i++)
+                LogRepWorkerComputeNextWakeup(i, now);
+
+            /*
+             * If a wakeup time for starting sync workers was set, just set it
+             * to right now.  It will be recalculated as needed.
+             */
+            if (next_sync_start != PG_INT64_MAX)
+                next_sync_start = now;
         }
```

Do we have to recalculate the NextWakeup when subscriber receives SIGHUP signal?
I think this may cause the unexpected change like following.

Assuming that wal_receiver_timeout is 60s, and wal_sender_timeout on publisher is
0s (or the network between nodes is disconnected).
And we send SIGHUP signal per 20s to subscriber's postmaster.

Currently the last_recv_time is calcurated when the worker accepts messages,
and the value is used for deciding to send a ping. The worker will exit if the
walsender does not reply.

But in your patch, the apply worker calcurates wakeup[LRW_WAKEUP_PING] and
wakeup[LRW_WAKEUP_TERMINATE] again when it gets SIGHUP, so the worker never sends
ping with requestReply = true, and never exits due to the timeout.

My case seems to be crazy, but there may be another issues if it remains.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED




Re: suppressing useless wakeups in logical/worker.c

From
Nathan Bossart
Date:
On Mon, Dec 05, 2022 at 01:00:19PM +0000, Hayato Kuroda (Fujitsu) wrote:
> But in your patch, the apply worker calcurates wakeup[LRW_WAKEUP_PING] and
> wakeup[LRW_WAKEUP_TERMINATE] again when it gets SIGHUP, so the worker never sends
> ping with requestReply = true, and never exits due to the timeout.

This is the case for the walreceiver patch, too.  If a SIGHUP arrives just
before we are due to ping the server, the ping wakeup time will be pushed
back.  To me, this seems unlikely to cause any issues in practice unless
the server is being constantly SIGHUP'd.  If we wanted to fix this, we'd
probably need to recompute the wakeup times using the values currently set.
I haven't looked into this too closely, but it doesn't sound tremendously
difficult.  Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: suppressing useless wakeups in logical/worker.c

From
Nathan Bossart
Date:
rebased for cfbot

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: suppressing useless wakeups in logical/worker.c

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> [ v2-0001-suppress-unnecessary-wakeups-in-logical-worker.c.patch ]

I took a look through this, and have a number of mostly-cosmetic
issues:

* It seems wrong that next_sync_start isn't handled as one of the
wakeup[NUM_LRW_WAKEUPS] entries.  I see that it needs to be accessed from
another module; but you could handle that without exposing either enum
LogRepWorkerWakeupReason or the array, by providing getter/setter
functions for process_syncing_tables_for_apply() to call.

* This code is far too intimately familiar with the fact that TimestampTz
is an int64 count of microseconds.  (I'm picky about that because I
remember that they were once something else, so I wonder if someday
they will be different again.)  You could get rid of the PG_INT64_MAX
usages by replacing those with the timestamp infinity macro DT_NOEND;
and I'd even be on board with adding a less-opaque alternate name for
that to datatype/timestamp.h.  The various magic-constant multipliers
could perhaps be made less magic by using TimestampTzPlusMilliseconds().

* I think it might be better to construct the enum like this:

+typedef enum LogRepWorkerWakeupReason
+{
+    LRW_WAKEUP_TERMINATE,
+    LRW_WAKEUP_PING,
+    LRW_WAKEUP_STATUS
+#define NUM_LRW_WAKEUPS (LRW_WAKEUP_STATUS + 1)
+} LogRepWorkerWakeupReason;

so that you don't have to have a default: case in switches on the
enum value.  I'm more worried about somebody adding an enum value
and missing updating a switch statement elsewhere than I am about
somebody adding an enum value and neglecting to update the
immediately-adjacent macro.

* The updates of "now" in LogicalRepApplyLoop seem rather
randomly placed, and I'm not entirely convinced that we'll
always be using a reasonably up-to-date value.  Can't we
just update it right before each usage?

* This special handling of next_sync_start seems mighty ugly:

+            /* Also consider special wakeup time for starting sync workers. */
+            if (next_sync_start < now)
+            {
+                /*
+                 * Instead of spinning while we wait for the sync worker to
+                 * start, wait a bit before retrying (unless there's an earlier
+                 * wakeup time).
+                 */
+                nextWakeup = Min(now + INT64CONST(100000), nextWakeup);
+            }
+            else
+                nextWakeup = Min(next_sync_start, nextWakeup);

Do we really need the slop?  If so, is there a reason why it
shouldn't apply to all possible sources of nextWakeup?  (It's
going to be hard to fold next_sync_start into the wakeup[]
array unless you can make this not a special case.)

* It'd probably be worth enlarging the comment for
LogRepWorkerComputeNextWakeup to explain why its API is like that,
perhaps "We ask the caller to pass in the value of "now" because
this frequently avoids multiple calls of GetCurrentTimestamp().
It had better be a reasonably-up-to-date value, though."

            regards, tom lane



Re: suppressing useless wakeups in logical/worker.c

From
Nathan Bossart
Date:
On Tue, Jan 24, 2023 at 06:45:08PM -0500, Tom Lane wrote:
> I took a look through this, and have a number of mostly-cosmetic
> issues:

Thanks for the detailed review.

> * It seems wrong that next_sync_start isn't handled as one of the
> wakeup[NUM_LRW_WAKEUPS] entries.  I see that it needs to be accessed from
> another module; but you could handle that without exposing either enum
> LogRepWorkerWakeupReason or the array, by providing getter/setter
> functions for process_syncing_tables_for_apply() to call.
> 
> * This code is far too intimately familiar with the fact that TimestampTz
> is an int64 count of microseconds.  (I'm picky about that because I
> remember that they were once something else, so I wonder if someday
> they will be different again.)  You could get rid of the PG_INT64_MAX
> usages by replacing those with the timestamp infinity macro DT_NOEND;
> and I'd even be on board with adding a less-opaque alternate name for
> that to datatype/timestamp.h.  The various magic-constant multipliers
> could perhaps be made less magic by using TimestampTzPlusMilliseconds(). 
> 
> * I think it might be better to construct the enum like this:
> 
> +typedef enum LogRepWorkerWakeupReason
> +{
> +    LRW_WAKEUP_TERMINATE,
> +    LRW_WAKEUP_PING,
> +    LRW_WAKEUP_STATUS
> +#define NUM_LRW_WAKEUPS (LRW_WAKEUP_STATUS + 1)
> +} LogRepWorkerWakeupReason;
> 
> so that you don't have to have a default: case in switches on the
> enum value.  I'm more worried about somebody adding an enum value
> and missing updating a switch statement elsewhere than I am about 
> somebody adding an enum value and neglecting to update the
> immediately-adjacent macro.

I did all of this in v3.

> * The updates of "now" in LogicalRepApplyLoop seem rather
> randomly placed, and I'm not entirely convinced that we'll
> always be using a reasonably up-to-date value.  Can't we
> just update it right before each usage?

This came up for walreceiver.c, too.  The concern is that
GetCurrentTimestamp() might be rather expensive on systems without
something like the vDSO.  I don't know how common that is.  If we can rule
that out, then I agree that we should just update it right before each use.

> * This special handling of next_sync_start seems mighty ugly:
> 
> +            /* Also consider special wakeup time for starting sync workers. */
> +            if (next_sync_start < now)
> +            {
> +                /*
> +                 * Instead of spinning while we wait for the sync worker to
> +                 * start, wait a bit before retrying (unless there's an earlier
> +                 * wakeup time).
> +                 */
> +                nextWakeup = Min(now + INT64CONST(100000), nextWakeup);
> +            }
> +            else
> +                nextWakeup = Min(next_sync_start, nextWakeup);
> 
> Do we really need the slop?  If so, is there a reason why it
> shouldn't apply to all possible sources of nextWakeup?  (It's
> going to be hard to fold next_sync_start into the wakeup[]
> array unless you can make this not a special case.)

I'm not positive it is absolutely necessary.  AFAICT the function that
updates this particular wakeup time is conditionally called, so it at least
seems theoretically possible that we could end up spinning in a tight loop
until we attempt to start a new tablesync worker.  But perhaps this is
unlikely enough that we needn't worry about it.

I noticed that this wakeup time wasn't being updated when the number of
active tablesync workers is >= max_sync_workers_per_subscription.  In v3, I
tried to handle this by setting the wakeup time to a second later for this
case.  I think you could ordinarily depend on the tablesync worker's
notify_pid to wake up the apply worker, but that wouldn't work if the apply
worker has restarted.

Ultimately, this particular wakeup time seems to be a special case, and I
probably need to think about it some more.  If you have ideas, I'm all
ears.

> * It'd probably be worth enlarging the comment for
> LogRepWorkerComputeNextWakeup to explain why its API is like that,
> perhaps "We ask the caller to pass in the value of "now" because
> this frequently avoids multiple calls of GetCurrentTimestamp().
> It had better be a reasonably-up-to-date value, though."

I did this in v3.  I noticed that many of your comments also applied to the
similar patch that was recently applied to walreceiver.c, so I created
another patch to fix that up.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: suppressing useless wakeups in logical/worker.c

From
Thomas Munro
Date:
On Thu, Jan 26, 2023 at 12:50 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
> I did this in v3.  I noticed that many of your comments also applied to the
> similar patch that was recently applied to walreceiver.c, so I created
> another patch to fix that up.

Can we also use TimestampDifferenceMilliseconds()?  It knows about
rounding up for WaitLatch().



Re: suppressing useless wakeups in logical/worker.c

From
Nathan Bossart
Date:
On Thu, Jan 26, 2023 at 01:23:41PM +1300, Thomas Munro wrote:
> Can we also use TimestampDifferenceMilliseconds()?  It knows about
> rounding up for WaitLatch().

I think we might risk overflowing "long" when all the wakeup times are
DT_NOEND:

     * This is typically used to calculate a wait timeout for WaitLatch()
     * or a related function.  The choice of "long" as the result type
     * is to harmonize with that.  It is caller's responsibility that the
     * input timestamps not be so far apart as to risk overflow of "long"
     * (which'd happen at about 25 days on machines with 32-bit "long").

Maybe we can adjust that function or create a new one to deal with this.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: suppressing useless wakeups in logical/worker.c

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> I think we might risk overflowing "long" when all the wakeup times are
> DT_NOEND:

>      * This is typically used to calculate a wait timeout for WaitLatch()
>      * or a related function.  The choice of "long" as the result type
>      * is to harmonize with that.  It is caller's responsibility that the
>      * input timestamps not be so far apart as to risk overflow of "long"
>      * (which'd happen at about 25 days on machines with 32-bit "long").

> Maybe we can adjust that function or create a new one to deal with this.

It'd probably be reasonable to file down that sharp edge by instead
specifying that TimestampDifferenceMilliseconds will clamp overflowing
differences to LONG_MAX.  Maybe there should be a clamp on the underflow
side too ... but should it be to LONG_MIN or to zero?

BTW, as long as we're discussing roundoff gotchas, I noticed while
testing your previous patch that there's some inconsistency between
TimestampDifferenceExceeds and TimestampDifferenceMilliseconds.
What you submitted at [1] did this:

+            if (TimestampDifferenceExceeds(last_start, now,
+                                           wal_retrieve_retry_interval))
+                ...
+            else
+            {
+                long        elapsed;
+
+                elapsed = TimestampDifferenceMilliseconds(last_start, now);
+                wait_time = Min(wait_time, wal_retrieve_retry_interval - elapsed);
+            }

and I discovered that that could sometimes busy-wait by repeatedly
falling through to the "else", but then calculating elapsed ==
wal_retrieve_retry_interval and hence setting wait_time to zero.
I fixed it in the committed version [2] by always computing "elapsed"
and then checking if that's strictly less than
wal_retrieve_retry_interval, but I bet there's existing code with the
same issue.  I think we need to take a closer look at making
TimestampDifferenceMilliseconds' roundoff behavior match the outcome of
TimestampDifferenceExceeds comparisons.

            regards, tom lane

[1] https://www.postgresql.org/message-id/20230110174345.GA1292607%40nathanxps13
[2] https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=5a3a95385



Re: suppressing useless wakeups in logical/worker.c

From
Thomas Munro
Date:
On Thu, Jan 26, 2023 at 3:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
> > I think we might risk overflowing "long" when all the wakeup times are
> > DT_NOEND:
>
> >        * This is typically used to calculate a wait timeout for WaitLatch()
> >        * or a related function.  The choice of "long" as the result type
> >        * is to harmonize with that.  It is caller's responsibility that the
> >        * input timestamps not be so far apart as to risk overflow of "long"
> >        * (which'd happen at about 25 days on machines with 32-bit "long").
>
> > Maybe we can adjust that function or create a new one to deal with this.
>
> It'd probably be reasonable to file down that sharp edge by instead
> specifying that TimestampDifferenceMilliseconds will clamp overflowing
> differences to LONG_MAX.  Maybe there should be a clamp on the underflow
> side too ... but should it be to LONG_MIN or to zero?

That got me curious... Why did WaitLatch() use long in the first
place?  I see that it was in Heikki's original sketch[1], but I can't
think of any implementation reason for it.  Note that the current
implementation of WaitLatch() et al will reach WaitEventSetWait()'s
assertion that the timeout is <= INT_MAX, so a LONG_MAX clamp isn't
right without further clamping.  Then internally,
WaitEventSetWaitBlock() takes an int, so there is an implicit cast to
int.  If I had to guess I'd say the reasons for long in the API are
lost, and the WES rewrite used in "int" because that's what poll() and
epoll_wait() wanted.

[1] https://www.postgresql.org/message-id/flat/4C72E85C.3000201%2540enterprisedb.com



Re: suppressing useless wakeups in logical/worker.c

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Thu, Jan 26, 2023 at 3:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> It'd probably be reasonable to file down that sharp edge by instead
>> specifying that TimestampDifferenceMilliseconds will clamp overflowing
>> differences to LONG_MAX.  Maybe there should be a clamp on the underflow
>> side too ... but should it be to LONG_MIN or to zero?

> That got me curious... Why did WaitLatch() use long in the first
> place?

Good question.  It's not a great choice, because of the inherent
platform specificity.  OTOH, I'm not sure it's worth the pain
to change now.

            regards, tom lane



Re: suppressing useless wakeups in logical/worker.c

From
Tom Lane
Date:
I wrote:
>>> It'd probably be reasonable to file down that sharp edge by instead
>>> specifying that TimestampDifferenceMilliseconds will clamp overflowing
>>> differences to LONG_MAX.  Maybe there should be a clamp on the underflow
>>> side too ... but should it be to LONG_MIN or to zero?

After looking closer, I see that TimestampDifferenceMilliseconds
already explicitly states that its output is intended for WaitLatch
and friends, which makes it perfectly sane for it to clamp the result
to [0, INT_MAX] rather than depending on the caller to not pass
out-of-range values.

I checked existing callers, and found one place in basebackup_copy.c
that had not read the memo about TimestampDifferenceMilliseconds
never returning a negative value, and another in postmaster.c that
had not read the memo about Min() and Max() being macros.  There
are none that are unhappy about clamping to INT_MAX, and at least
one that was already assuming it could just cast the result to int.

Hence, I propose the attached.  I haven't gone as far as to change
the result type from long to int; that seems like a lot of code
churn (if we are to update WaitLatch and all callers to match)
and it won't really buy anything semantically.

            regards, tom lane

diff --git a/src/backend/backup/basebackup_copy.c b/src/backend/backup/basebackup_copy.c
index 05470057f5..2bb6c89f8c 100644
--- a/src/backend/backup/basebackup_copy.c
+++ b/src/backend/backup/basebackup_copy.c
@@ -215,7 +215,8 @@ bbsink_copystream_archive_contents(bbsink *sink, size_t len)
          * the system clock was set backward, so that such occurrences don't
          * have the effect of suppressing further progress messages.
          */
-        if (ms < 0 || ms >= PROGRESS_REPORT_MILLISECOND_THRESHOLD)
+        if (ms >= PROGRESS_REPORT_MILLISECOND_THRESHOLD ||
+            now < mysink->last_progress_report_time)
         {
             mysink->last_progress_report_time = now;

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5b775cf7d0..62fba5fcee 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1670,11 +1670,12 @@ DetermineSleepTime(void)

     if (next_wakeup != 0)
     {
-        /* Ensure we don't exceed one minute, or go under 0. */
-        return Max(0,
-                   Min(60 * 1000,
-                       TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
-                                                       next_wakeup)));
+        int            ms;
+
+        /* result of TimestampDifferenceMilliseconds is in [0, INT_MAX] */
+        ms = (int) TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
+                                                   next_wakeup);
+        return Min(60 * 1000, ms);
     }

     return 60 * 1000;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e95398db05..b0cfddd548 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -445,7 +445,7 @@ WalReceiverMain(void)
                 pgsocket    wait_fd = PGINVALID_SOCKET;
                 int            rc;
                 TimestampTz nextWakeup;
-                int            nap;
+                long        nap;

                 /*
                  * Exit walreceiver if we're not in recovery. This should not
@@ -528,15 +528,9 @@ WalReceiverMain(void)
                 for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
                     nextWakeup = Min(wakeup[i], nextWakeup);

-                /*
-                 * Calculate the nap time.  WaitLatchOrSocket() doesn't accept
-                 * timeouts longer than INT_MAX milliseconds, so we limit the
-                 * result accordingly.  Also, we round up to the next
-                 * millisecond to avoid waking up too early and spinning until
-                 * one of the wakeup times.
-                 */
+                /* Calculate the nap time, clamping as necessary. */
                 now = GetCurrentTimestamp();
-                nap = (int) Min(INT_MAX, Max(0, (nextWakeup - now + 999) / 1000));
+                nap = TimestampDifferenceMilliseconds(now, nextWakeup);

                 /*
                  * Ideally we would reuse a WaitEventSet object repeatedly
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index 928c330897..b1d1963729 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -1690,12 +1690,12 @@ TimestampDifference(TimestampTz start_time, TimestampTz stop_time,
  *
  * This is typically used to calculate a wait timeout for WaitLatch()
  * or a related function.  The choice of "long" as the result type
- * is to harmonize with that.  It is caller's responsibility that the
- * input timestamps not be so far apart as to risk overflow of "long"
- * (which'd happen at about 25 days on machines with 32-bit "long").
+ * is to harmonize with that; furthermore, we clamp the result to at most
+ * INT_MAX milliseconds, because that's all that WaitLatch() allows.
  *
- * Both inputs must be ordinary finite timestamps (in current usage,
- * they'll be results from GetCurrentTimestamp()).
+ * At least one input must be an ordinary finite timestamp, else the "diff"
+ * calculation might overflow.  We do support stop_time == TIMESTAMP_INFINITY,
+ * which will result in INT_MAX wait time.
  *
  * We expect start_time <= stop_time.  If not, we return zero,
  * since then we're already past the previously determined stop_time.
@@ -1710,6 +1710,8 @@ TimestampDifferenceMilliseconds(TimestampTz start_time, TimestampTz stop_time)

     if (diff <= 0)
         return 0;
+    else if (diff >= (INT_MAX * INT64CONST(1000) - 999))
+        return (long) INT_MAX;
     else
         return (long) ((diff + 999) / 1000);
 }

Re: suppressing useless wakeups in logical/worker.c

From
Nathan Bossart
Date:
On Thu, Jan 26, 2023 at 01:54:08PM -0500, Tom Lane wrote:
> After looking closer, I see that TimestampDifferenceMilliseconds
> already explicitly states that its output is intended for WaitLatch
> and friends, which makes it perfectly sane for it to clamp the result
> to [0, INT_MAX] rather than depending on the caller to not pass
> out-of-range values.

+1

>   * This is typically used to calculate a wait timeout for WaitLatch()
>   * or a related function.  The choice of "long" as the result type
> - * is to harmonize with that.  It is caller's responsibility that the
> - * input timestamps not be so far apart as to risk overflow of "long"
> - * (which'd happen at about 25 days on machines with 32-bit "long").
> + * is to harmonize with that; furthermore, we clamp the result to at most
> + * INT_MAX milliseconds, because that's all that WaitLatch() allows.
>   *
> - * Both inputs must be ordinary finite timestamps (in current usage,
> - * they'll be results from GetCurrentTimestamp()).
> + * At least one input must be an ordinary finite timestamp, else the "diff"
> + * calculation might overflow.  We do support stop_time == TIMESTAMP_INFINITY,
> + * which will result in INT_MAX wait time.

I wonder if we should explicitly reject negative timestamps to eliminate
any chance of int64 overflow, too.  Alternatively, we could detect that the
operation will overflow and return either 0 or INT_MAX, but I assume
there's minimal use of this function with negative timestamps.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: suppressing useless wakeups in logical/worker.c

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> On Thu, Jan 26, 2023 at 01:54:08PM -0500, Tom Lane wrote:
>> - * Both inputs must be ordinary finite timestamps (in current usage,
>> - * they'll be results from GetCurrentTimestamp()).
>> + * At least one input must be an ordinary finite timestamp, else the "diff"
>> + * calculation might overflow.  We do support stop_time == TIMESTAMP_INFINITY,
>> + * which will result in INT_MAX wait time.

> I wonder if we should explicitly reject negative timestamps to eliminate
> any chance of int64 overflow, too.

Hmm.  I'm disinclined to add an assumption that the epoch is in the past,
but I take your point that the subtraction would overflow with
TIMESTAMP_INFINITY and a negative finite timestamp.  Maybe we should
make use of pg_sub_s64_overflow()?

BTW, I just noticed that the adjacent function TimestampDifference
has a lot of callers that would be much happier using
TimestampDifferenceMilliseconds.

            regards, tom lane



Re: suppressing useless wakeups in logical/worker.c

From
Nathan Bossart
Date:
On Thu, Jan 26, 2023 at 03:04:30PM -0500, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
>> I wonder if we should explicitly reject negative timestamps to eliminate
>> any chance of int64 overflow, too.
> 
> Hmm.  I'm disinclined to add an assumption that the epoch is in the past,
> but I take your point that the subtraction would overflow with
> TIMESTAMP_INFINITY and a negative finite timestamp.  Maybe we should
> make use of pg_sub_s64_overflow()?

That would be my vote.  I think the 'diff <= 0' check might need to be
replaced with something like 'start_time > stop_time' so that we return 0
for the underflow case.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: suppressing useless wakeups in logical/worker.c

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> On Thu, Jan 26, 2023 at 03:04:30PM -0500, Tom Lane wrote:
>> Hmm.  I'm disinclined to add an assumption that the epoch is in the past,
>> but I take your point that the subtraction would overflow with
>> TIMESTAMP_INFINITY and a negative finite timestamp.  Maybe we should
>> make use of pg_sub_s64_overflow()?

> That would be my vote.  I think the 'diff <= 0' check might need to be
> replaced with something like 'start_time > stop_time' so that we return 0
> for the underflow case.

Right, so more like this.

            regards, tom lane

diff --git a/src/backend/backup/basebackup_copy.c b/src/backend/backup/basebackup_copy.c
index 05470057f5..2bb6c89f8c 100644
--- a/src/backend/backup/basebackup_copy.c
+++ b/src/backend/backup/basebackup_copy.c
@@ -215,7 +215,8 @@ bbsink_copystream_archive_contents(bbsink *sink, size_t len)
          * the system clock was set backward, so that such occurrences don't
          * have the effect of suppressing further progress messages.
          */
-        if (ms < 0 || ms >= PROGRESS_REPORT_MILLISECOND_THRESHOLD)
+        if (ms >= PROGRESS_REPORT_MILLISECOND_THRESHOLD ||
+            now < mysink->last_progress_report_time)
         {
             mysink->last_progress_report_time = now;

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5b775cf7d0..62fba5fcee 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1670,11 +1670,12 @@ DetermineSleepTime(void)

     if (next_wakeup != 0)
     {
-        /* Ensure we don't exceed one minute, or go under 0. */
-        return Max(0,
-                   Min(60 * 1000,
-                       TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
-                                                       next_wakeup)));
+        int            ms;
+
+        /* result of TimestampDifferenceMilliseconds is in [0, INT_MAX] */
+        ms = (int) TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
+                                                   next_wakeup);
+        return Min(60 * 1000, ms);
     }

     return 60 * 1000;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index e95398db05..b0cfddd548 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -445,7 +445,7 @@ WalReceiverMain(void)
                 pgsocket    wait_fd = PGINVALID_SOCKET;
                 int            rc;
                 TimestampTz nextWakeup;
-                int            nap;
+                long        nap;

                 /*
                  * Exit walreceiver if we're not in recovery. This should not
@@ -528,15 +528,9 @@ WalReceiverMain(void)
                 for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i)
                     nextWakeup = Min(wakeup[i], nextWakeup);

-                /*
-                 * Calculate the nap time.  WaitLatchOrSocket() doesn't accept
-                 * timeouts longer than INT_MAX milliseconds, so we limit the
-                 * result accordingly.  Also, we round up to the next
-                 * millisecond to avoid waking up too early and spinning until
-                 * one of the wakeup times.
-                 */
+                /* Calculate the nap time, clamping as necessary. */
                 now = GetCurrentTimestamp();
-                nap = (int) Min(INT_MAX, Max(0, (nextWakeup - now + 999) / 1000));
+                nap = TimestampDifferenceMilliseconds(now, nextWakeup);

                 /*
                  * Ideally we would reuse a WaitEventSet object repeatedly
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index 928c330897..47e059a409 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -1690,26 +1690,31 @@ TimestampDifference(TimestampTz start_time, TimestampTz stop_time,
  *
  * This is typically used to calculate a wait timeout for WaitLatch()
  * or a related function.  The choice of "long" as the result type
- * is to harmonize with that.  It is caller's responsibility that the
- * input timestamps not be so far apart as to risk overflow of "long"
- * (which'd happen at about 25 days on machines with 32-bit "long").
- *
- * Both inputs must be ordinary finite timestamps (in current usage,
- * they'll be results from GetCurrentTimestamp()).
+ * is to harmonize with that; furthermore, we clamp the result to at most
+ * INT_MAX milliseconds, because that's all that WaitLatch() allows.
  *
  * We expect start_time <= stop_time.  If not, we return zero,
  * since then we're already past the previously determined stop_time.
  *
+ * Subtracting finite and infinite timestamps works correctly, returning
+ * zero or INT_MAX as appropriate.
+ *
  * Note we round up any fractional millisecond, since waiting for just
  * less than the intended timeout is undesirable.
  */
 long
 TimestampDifferenceMilliseconds(TimestampTz start_time, TimestampTz stop_time)
 {
-    TimestampTz diff = stop_time - start_time;
+    TimestampTz diff;

-    if (diff <= 0)
+    /* Deal with zero or negative elapsed time quickly. */
+    if (start_time >= stop_time)
         return 0;
+    /* To not fail with timestamp infinities, we must detect overflow. */
+    if (pg_sub_s64_overflow(stop_time, start_time, &diff))
+        return (long) INT_MAX;
+    if (diff >= (INT_MAX * INT64CONST(1000) - 999))
+        return (long) INT_MAX;
     else
         return (long) ((diff + 999) / 1000);
 }

Re: suppressing useless wakeups in logical/worker.c

From
Nathan Bossart
Date:
On Thu, Jan 26, 2023 at 04:09:51PM -0500, Tom Lane wrote:
> Right, so more like this.

LGTM

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: suppressing useless wakeups in logical/worker.c

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> On Thu, Jan 26, 2023 at 04:09:51PM -0500, Tom Lane wrote:
>> Right, so more like this.

> LGTM

Thanks, pushed.

Returning to the prior patch ... I don't much care for this:

+                    /* Maybe there will be a free slot in a second... */
+                    retry_time = TimestampTzPlusSeconds(now, 1);
+                    LogRepWorkerUpdateSyncStartWakeup(retry_time);

We're not moving the goalposts very far on unnecessary wakeups if
we have to do that.  Do we need to get a wakeup on sync slot free?
Although having to send that to every worker seems ugly.  Maybe this
is being done in the wrong place and we need to find a way to get
the launcher to handle it.

As for the business about process_syncing_tables being only called
conditionally, I was already of the opinion that the way it's
getting called is loony.  Why isn't it called from LogicalRepApplyLoop
(and noplace else)?  With more certainty about when it runs, we might
not need so many kluges elsewhere.

            regards, tom lane



Re: suppressing useless wakeups in logical/worker.c

From
Amit Kapila
Date:
On Fri, Jan 27, 2023 at 4:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Thanks, pushed.
>
> Returning to the prior patch ... I don't much care for this:
>
> +                    /* Maybe there will be a free slot in a second... */
> +                    retry_time = TimestampTzPlusSeconds(now, 1);
> +                    LogRepWorkerUpdateSyncStartWakeup(retry_time);
>
> We're not moving the goalposts very far on unnecessary wakeups if
> we have to do that.  Do we need to get a wakeup on sync slot free?
> Although having to send that to every worker seems ugly.  Maybe this
> is being done in the wrong place and we need to find a way to get
> the launcher to handle it.
>
> As for the business about process_syncing_tables being only called
> conditionally, I was already of the opinion that the way it's
> getting called is loony.  Why isn't it called from LogicalRepApplyLoop
> (and noplace else)?

Currently, it seems to be called after processing transaction end
commands or when we are not in any transaction. As per my
understanding, that is when we can ensure the sync between tablesync
and apply worker. For example, say when tablesync worker is doing the
initial copy, the apply worker went ahead and processed some
additional xacts (WAL), now the tablesync worker needs to process all
those transactions after initial sync and before it can mark the state
as SYNCDONE. So that can be checked only at transaction boundries.

However, it is not very clear to me why the patch needs the below code.
@@ -3615,7 +3639,33 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
  if (!dlist_is_empty(&lsn_mapping))
  wait_time = WalWriterDelay;
  else
- wait_time = NAPTIME_PER_CYCLE;
+ {
+ TimestampTz nextWakeup = DT_NOEND;
+
+ /*
+ * Since process_syncing_tables() is called conditionally, the
+ * tablesync worker start wakeup time might be in the past, and we
+ * can't know for sure when it will be updated again.  Rather than
+ * spinning in a tight loop in this case, bump this wakeup time by
+ * a second.
+ */
+ now = GetCurrentTimestamp();
+ if (wakeup[LRW_WAKEUP_SYNC_START] < now)
+ wakeup[LRW_WAKEUP_SYNC_START] =
TimestampTzPlusSeconds(wakeup[LRW_WAKEUP_SYNC_START], 1);

Do we see unnecessary wakeups without this, or delay in sync?

BTW, do we need to do something about wakeups in
wait_for_relation_state_change()?

-- 
With Regards,
Amit Kapila.



Re: suppressing useless wakeups in logical/worker.c

From
Nathan Bossart
Date:
On Sat, Jan 28, 2023 at 10:26:25AM +0530, Amit Kapila wrote:
> On Fri, Jan 27, 2023 at 4:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Returning to the prior patch ... I don't much care for this:
>>
>> +                    /* Maybe there will be a free slot in a second... */
>> +                    retry_time = TimestampTzPlusSeconds(now, 1);
>> +                    LogRepWorkerUpdateSyncStartWakeup(retry_time);
>>
>> We're not moving the goalposts very far on unnecessary wakeups if
>> we have to do that.  Do we need to get a wakeup on sync slot free?
>> Although having to send that to every worker seems ugly.  Maybe this
>> is being done in the wrong place and we need to find a way to get
>> the launcher to handle it.

It might be feasible to set up a before_shmem_exit() callback that wakes up
the apply worker (like is already done for the launcher).  I think the
apply worker is ordinarily notified via the tablesync worker's notify_pid,
but AFAICT there's no guarantee that the apply worker hasn't restarted with
a different PID.

> + /*
> + * Since process_syncing_tables() is called conditionally, the
> + * tablesync worker start wakeup time might be in the past, and we
> + * can't know for sure when it will be updated again.  Rather than
> + * spinning in a tight loop in this case, bump this wakeup time by
> + * a second.
> + */
> + now = GetCurrentTimestamp();
> + if (wakeup[LRW_WAKEUP_SYNC_START] < now)
> + wakeup[LRW_WAKEUP_SYNC_START] =
> TimestampTzPlusSeconds(wakeup[LRW_WAKEUP_SYNC_START], 1);
> 
> Do we see unnecessary wakeups without this, or delay in sync?

I haven't looked too cloesly to see whether busy loops are likely in
practice.

> BTW, do we need to do something about wakeups in
> wait_for_relation_state_change()?

... and wait_for_worker_state_change(), and copy_read_data().  From a quick
glance, it looks like fixing these would be a more invasive change.  TBH
I'm beginning to wonder whether all this is really worth it to prevent
waking up once per second.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: suppressing useless wakeups in logical/worker.c

From
Amit Kapila
Date:
On Wed, Feb 1, 2023 at 5:35 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Sat, Jan 28, 2023 at 10:26:25AM +0530, Amit Kapila wrote:
> > On Fri, Jan 27, 2023 at 4:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Returning to the prior patch ... I don't much care for this:
> >>
> >> +                    /* Maybe there will be a free slot in a second... */
> >> +                    retry_time = TimestampTzPlusSeconds(now, 1);
> >> +                    LogRepWorkerUpdateSyncStartWakeup(retry_time);
> >>
> >> We're not moving the goalposts very far on unnecessary wakeups if
> >> we have to do that.  Do we need to get a wakeup on sync slot free?
> >> Although having to send that to every worker seems ugly.  Maybe this
> >> is being done in the wrong place and we need to find a way to get
> >> the launcher to handle it.
>
> It might be feasible to set up a before_shmem_exit() callback that wakes up
> the apply worker (like is already done for the launcher).  I think the
> apply worker is ordinarily notified via the tablesync worker's notify_pid,
> but AFAICT there's no guarantee that the apply worker hasn't restarted with
> a different PID.
>
> > + /*
> > + * Since process_syncing_tables() is called conditionally, the
> > + * tablesync worker start wakeup time might be in the past, and we
> > + * can't know for sure when it will be updated again.  Rather than
> > + * spinning in a tight loop in this case, bump this wakeup time by
> > + * a second.
> > + */
> > + now = GetCurrentTimestamp();
> > + if (wakeup[LRW_WAKEUP_SYNC_START] < now)
> > + wakeup[LRW_WAKEUP_SYNC_START] =
> > TimestampTzPlusSeconds(wakeup[LRW_WAKEUP_SYNC_START], 1);
> >
> > Do we see unnecessary wakeups without this, or delay in sync?
>
> I haven't looked too cloesly to see whether busy loops are likely in
> practice.
>
> > BTW, do we need to do something about wakeups in
> > wait_for_relation_state_change()?
>
> ... and wait_for_worker_state_change(), and copy_read_data().  From a quick
> glance, it looks like fixing these would be a more invasive change.
>

What kind of logic do you have in mind to avoid waking up once per
second in those cases?

>  TBH
> I'm beginning to wonder whether all this is really worth it to prevent
> waking up once per second.
>

If we can't do it for all cases, do you see any harm in doing it for
cases where we can achieve it without adding much complexity? We can
probably add comments for others so that if someone else has better
ideas in the future we can deal with those as well.

--
With Regards,
Amit Kapila.



Re: suppressing useless wakeups in logical/worker.c

From
Nathan Bossart
Date:
I've attached a minimally-updated patch that doesn't yet address the bigger
topics under discussion.

On Thu, Mar 16, 2023 at 03:30:37PM +0530, Amit Kapila wrote:
> On Wed, Feb 1, 2023 at 5:35 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> On Sat, Jan 28, 2023 at 10:26:25AM +0530, Amit Kapila wrote:
>> > BTW, do we need to do something about wakeups in
>> > wait_for_relation_state_change()?
>>
>> ... and wait_for_worker_state_change(), and copy_read_data().  From a quick
>> glance, it looks like fixing these would be a more invasive change.
> 
> What kind of logic do you have in mind to avoid waking up once per
> second in those cases?

I haven't looked into this too much yet.  I'd probably try out Tom's
suggestions from upthread [0] next and see if those can be applied here,
too.

>>  TBH
>> I'm beginning to wonder whether all this is really worth it to prevent
>> waking up once per second.
> 
> If we can't do it for all cases, do you see any harm in doing it for
> cases where we can achieve it without adding much complexity? We can
> probably add comments for others so that if someone else has better
> ideas in the future we can deal with those as well.

I don't think there's any harm, but I'm also not sure it does a whole lot
of good.  At the very least, I think we should figure out something better
than the process_syncing_tables() hacks before taking this patch seriously.

[0] https://postgr.es/m/3220831.1674772625%40sss.pgh.pa.us

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: suppressing useless wakeups in logical/worker.c

From
Amit Kapila
Date:
On Fri, Mar 17, 2023 at 5:52 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> I've attached a minimally-updated patch that doesn't yet address the bigger
> topics under discussion.
>
> On Thu, Mar 16, 2023 at 03:30:37PM +0530, Amit Kapila wrote:
> > On Wed, Feb 1, 2023 at 5:35 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >> On Sat, Jan 28, 2023 at 10:26:25AM +0530, Amit Kapila wrote:
> >> > BTW, do we need to do something about wakeups in
> >> > wait_for_relation_state_change()?
> >>
> >> ... and wait_for_worker_state_change(), and copy_read_data().  From a quick
> >> glance, it looks like fixing these would be a more invasive change.
> >
> > What kind of logic do you have in mind to avoid waking up once per
> > second in those cases?
>
> I haven't looked into this too much yet.  I'd probably try out Tom's
> suggestions from upthread [0] next and see if those can be applied here,
> too.
>

For the clean exit of tablesync worker, we already wake up the apply
worker in finish_sync_worker(). You probably want to deal with error
cases or is there something else on your mind? BTW, for
wait_for_worker_state_change(), one possibility is to wake all the
sync workers when apply worker exits but not sure if that is a very
good idea.

Few minor comments:
=====================
1.
- if (wal_receiver_timeout > 0)
+ now = GetCurrentTimestamp();
+ if (now >= wakeup[LRW_WAKEUP_TERMINATE])
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("terminating logical replication worker due to timeout")));
+
+ /* Check to see if it's time for a ping. */
+ if (now >= wakeup[LRW_WAKEUP_PING])
  {
- TimestampTz now = GetCurrentTimestamp();

Previously, we use to call GetCurrentTimestamp() only when
wal_receiver_timeout > 0 but we ignore that part now. It may not
matter much but if possible let's avoid calling GetCurrentTimestamp()
at additional places.

2.
+ for (int i = 0; i < NUM_LRW_WAKEUPS; i++)
+ LogRepWorkerComputeNextWakeup(i, now);
+
+ /*
+ * LogRepWorkerComputeNextWakeup() will have cleared the tablesync
+ * worker start wakeup time, so we might not wake up to start a new
+ * worker at the appropriate time.  To deal with this, we set the
+ * wakeup time to right now so that
+ * process_syncing_tables_for_apply() recalculates it as soon as
+ * possible.
+ */
+ if (!am_tablesync_worker())
+ LogRepWorkerUpdateSyncStartWakeup(now);

Can't we avoid clearing syncstart time in the first place?

--
With Regards,
Amit Kapila.



Re: suppressing useless wakeups in logical/worker.c

From
Daniel Gustafsson
Date:
> On 17 Mar 2023, at 10:16, Amit Kapila <amit.kapila16@gmail.com> wrote:

> Few minor comments:

Have you had a chance to address the comments raised by Amit in this thread?

--
Daniel Gustafsson




Re: suppressing useless wakeups in logical/worker.c

From
Nathan Bossart
Date:
On Tue, Jul 04, 2023 at 09:48:23AM +0200, Daniel Gustafsson wrote:
>> On 17 Mar 2023, at 10:16, Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
>> Few minor comments:
> 
> Have you had a chance to address the comments raised by Amit in this thread?

Not yet, sorry.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com