Re: Changes to recovery_min_apply_delay are ignored while waiting for delay - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Changes to recovery_min_apply_delay are ignored while waiting for delay
Date
Msg-id YRN+9+HrQ1eVIS/5@paquier.xyz
Whole thread Raw
In response to Re: Changes to recovery_min_apply_delay are ignored while waiting for delay  (Soumyadeep Chakraborty <soumyadeep2007@gmail.com>)
Responses Re: Changes to recovery_min_apply_delay are ignored while waiting for delay
List pgsql-hackers
On Fri, Aug 06, 2021 at 04:59:55PM -0700, Soumyadeep Chakraborty wrote:
> Rebased. Also added a stronger check to see if the standby is stuck in
> recovery_min_apply_delay:
>
> $node_standby->poll_query_until('postgres', qq{
> SELECT wait_event = 'RecoveryApplyDelay' FROM pg_stat_activity
> WHERE backend_type='startup';
> }) or die "Timed out checking if startup is in recovery_min_apply_delay";

Relying on wait events is a good idea here.  This makes the test more
reliable to make sure that the startup process is stuck in the
WaitLatch phase.

> Attached v3.

Agreed that the current behavior is confusing.  As you are using the
commit timestamp for the comparison, this is right.  One small-ish
comment I have about the code is that we should mention
recovery_min_apply_delay for HandleStartupProcInterrupts(), and not
only the trigger file location.

+# First, set the delay for the next commit to some obscenely high value.
It has no need to be obscenely high, just high enough to give the time
to slow machines to catch the wait event lookup done.  So this could
use better words to explain this choice.

Anyway, it seems to me that something is incorrect in this new test
(manual tests pass here, the TAP test is off).  One thing we need to
make sure for any new test added is that it correctly breaks if the
fix proposed is *not* in place.  And as far as I can see, the test
passes even if the recalculation of delayUntil is done within the loop
that reloads the configuration.  The thing may be a bit tricky to make
reliable as the parameter reloading may cause wait_event to not be
RecoveryApplyDelay, so we should have at least a check based on a scan
of pg_stat_replication.replay_lsn on the primary.  Perhaps you have
an other idea?

When using wait_for_catchup(), I would recommend to list "replay" for
this test, even if that's the default mode, to make clear what the
intention is.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Unresolved repliaction hang and stop problem.
Next
From: Shruthi Gowda
Date:
Subject: Re: storing an explicit nonce