Thread: Switching XLog source from archive to streaming when primary available

Switching XLog source from archive to streaming when primary available

From
SATYANARAYANA NARLAPURAM
Date:
Hi Hackers,

When the standby couldn't connect to the primary it switches the XLog source from streaming to archive and continues in that state until it can get the WAL from the archive location. On a server with high WAL activity, typically getting the WAL from the archive is slower than streaming it from the primary and couldn't exit from that state. This not only increases the lag on the standby but also adversely impacts the primary as the WAL gets accumulated, and vacuum is not able to collect the dead tuples. DBAs as a mitigation can however remove/advance the slot or remove the restore_command on the standby but this is a manual work I am trying to avoid. I would like to propose the following, please let me know your thoughts.
  • Automatically attempt to switch the source from Archive to streaming when the primary_conninfo is set after replaying 'N' wal segment governed by the GUC retry_primary_conn_after_wal_segments
  • when  retry_primary_conn_after_wal_segments is set to -1 then the feature is disabled
  • When the retry attempt fails, then switch back to the archive
Thanks,
Satya

Re: Switching XLog source from archive to streaming when primary available

From
Dilip Kumar
Date:
On Mon, Nov 29, 2021 at 1:30 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:
>
> Hi Hackers,
>
> When the standby couldn't connect to the primary it switches the XLog source from streaming to archive and continues
inthat state until it can get the WAL from the archive location. On a server with high WAL activity, typically getting
theWAL from the archive is slower than streaming it from the primary and couldn't exit from that state. This not only
increasesthe lag on the standby but also adversely impacts the primary as the WAL gets accumulated, and vacuum is not
ableto collect the dead tuples. DBAs as a mitigation can however remove/advance the slot or remove the restore_command
onthe standby but this is a manual work I am trying to avoid. I would like to propose the following, please let me know
yourthoughts. 
>
> Automatically attempt to switch the source from Archive to streaming when the primary_conninfo is set after replaying
'N'wal segment governed by the GUC retry_primary_conn_after_wal_segments 
> when  retry_primary_conn_after_wal_segments is set to -1 then the feature is disabled
> When the retry attempt fails, then switch back to the archive

I think there is another thread [1] that is logically trying to solve
a similar issue, basically, in the main recovery apply loop is the
walreceiver does not exist then it is launching the walreceiver.
However, in that patch, it is not changing the current Xlog source but
I think that is not a good idea because with that it will restore from
the archive as well as stream from the primary so I have given that
review comment on that thread as well.  One big difference is that
patch is launching the walreceiver even if the WAL is locally
available and we don't really need more WAL but that is controlled by
a GUC.

[1] https://www.postgresql.org/message-id/CAKYtNApe05WmeRo92gTePEmhOM4myMpCK_%2BceSJtC7-AWLw1qw%40mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Mon, Nov 29, 2021 at 1:30 AM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:
>
> Hi Hackers,
>
> When the standby couldn't connect to the primary it switches the XLog source from streaming to archive and continues
inthat state until it can get the WAL from the archive location. On a server with high WAL activity, typically getting
theWAL from the archive is slower than streaming it from the primary and couldn't exit from that state. This not only
increasesthe lag on the standby but also adversely impacts the primary as the WAL gets accumulated, and vacuum is not
ableto collect the dead tuples. DBAs as a mitigation can however remove/advance the slot or remove the restore_command
onthe standby but this is a manual work I am trying to avoid. I would like to propose the following, please let me know
yourthoughts. 
>
> Automatically attempt to switch the source from Archive to streaming when the primary_conninfo is set after replaying
'N'wal segment governed by the GUC retry_primary_conn_after_wal_segments 
> when  retry_primary_conn_after_wal_segments is set to -1 then the feature is disabled
> When the retry attempt fails, then switch back to the archive

I've gone through the state machine in WaitForWALToBecomeAvailable and
I understand it this way: failed to receive WAL records from the
primary causes the current source to switch to archive and the standby
continues to get WAL records from archive location unless some failure
occurs there the current source is never going to switch back to
stream. Given the fact that getting WAL from archive location causes
delay in production environments, we miss to take the advantage of the
reconnection to primary after previous failed attempt.

So basically, we try to attempt to switch to streaming from archive
(even though fetching from archive can succeed) after a certain amount
of time or WAL segments. I prefer timing-based switch to streaming
from archive instead of after a number of WAL segments fetched from
archive. Right now, wal_retrieve_retry_interval is being used to wait
before switching to archive after failed attempt from streaming, IMO,
a similar GUC (that gets set once the source switched from streaming
to archive and on timeout it switches to streaming again) can be used
to switch from archive to streaming after the specified amount of
time.

Thoughts?

Regards,
Bharath Rupireddy.



Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Sat, Apr 30, 2022 at 6:19 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Mon, Nov 29, 2021 at 1:30 AM SATYANARAYANA NARLAPURAM
> <satyanarlapuram@gmail.com> wrote:
> >
> > Hi Hackers,
> >
> > When the standby couldn't connect to the primary it switches the XLog source from streaming to archive and
continuesin that state until it can get the WAL from the archive location. On a server with high WAL activity,
typicallygetting the WAL from the archive is slower than streaming it from the primary and couldn't exit from that
state.This not only increases the lag on the standby but also adversely impacts the primary as the WAL gets
accumulated,and vacuum is not able to collect the dead tuples. DBAs as a mitigation can however remove/advance the slot
orremove the restore_command on the standby but this is a manual work I am trying to avoid. I would like to propose the
following,please let me know your thoughts. 
> >
> > Automatically attempt to switch the source from Archive to streaming when the primary_conninfo is set after
replaying'N' wal segment governed by the GUC retry_primary_conn_after_wal_segments 
> > when  retry_primary_conn_after_wal_segments is set to -1 then the feature is disabled
> > When the retry attempt fails, then switch back to the archive
>
> I've gone through the state machine in WaitForWALToBecomeAvailable and
> I understand it this way: failed to receive WAL records from the
> primary causes the current source to switch to archive and the standby
> continues to get WAL records from archive location unless some failure
> occurs there the current source is never going to switch back to
> stream. Given the fact that getting WAL from archive location causes
> delay in production environments, we miss to take the advantage of the
> reconnection to primary after previous failed attempt.
>
> So basically, we try to attempt to switch to streaming from archive
> (even though fetching from archive can succeed) after a certain amount
> of time or WAL segments. I prefer timing-based switch to streaming
> from archive instead of after a number of WAL segments fetched from
> archive. Right now, wal_retrieve_retry_interval is being used to wait
> before switching to archive after failed attempt from streaming, IMO,
> a similar GUC (that gets set once the source switched from streaming
> to archive and on timeout it switches to streaming again) can be used
> to switch from archive to streaming after the specified amount of
> time.
>
> Thoughts?

Here's a v1 patch that I've come up with. I'm right now using the
existing GUC wal_retrieve_retry_interval to switch to stream mode from
archive mode as opposed to switching only after the failure to get WAL
from archive mode. If okay with the approach, I can add tests, change
the docs and add a new GUC to control this behaviour. I'm open to
thoughts and ideas here.

Regards,
Bharath Rupireddy.

Attachment
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           not tested
Documentation:            not tested

Hello

I tested this patch in a setup where the standby is in the middle of replicating and REDOing primary's WAL files during
avery large data insertion. During this time, I keep killing the walreceiver process to cause a stream failure and
forcestandby to read from archive. The system will restore from archive for "wal_retrieve_retry_interval" seconds
beforeit attempts to steam again. Without this patch, once the streaming is interrupted, it keeps reading from archive
untilstandby reaches the same consistent state of primary and then it will switch back to streaming again. So it seems
thatthe patch does the job as described and does bring some benefit during a very large REDO job where it will try to
re-streamafter restoring some WALs from archive to speed up this "catch up" process. But if the recovery job is not a
largeone, PG is already switching back to streaming once it hits consistent state.
 

thank you

Cary Huang
HighGo Software Canada

Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Sat, Jun 25, 2022 at 1:31 AM Cary Huang <cary.huang@highgo.ca> wrote:
>
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:       tested, passed
> Spec compliant:           not tested
> Documentation:            not tested
>
> Hello
>
> I tested this patch in a setup where the standby is in the middle of replicating and REDOing primary's WAL files
duringa very large data insertion. During this time, I keep killing the walreceiver process to cause a stream failure
andforce standby to read from archive. The system will restore from archive for "wal_retrieve_retry_interval" seconds
beforeit attempts to steam again. Without this patch, once the streaming is interrupted, it keeps reading from archive
untilstandby reaches the same consistent state of primary and then it will switch back to streaming again. So it seems
thatthe patch does the job as described and does bring some benefit during a very large REDO job where it will try to
re-streamafter restoring some WALs from archive to speed up this "catch up" process. But if the recovery job is not a
largeone, PG is already switching back to streaming once it hits consistent state. 

Thanks a lot Cary for testing the patch.

> Here's a v1 patch that I've come up with. I'm right now using the
> existing GUC wal_retrieve_retry_interval to switch to stream mode from
> archive mode as opposed to switching only after the failure to get WAL
> from archive mode. If okay with the approach, I can add tests, change
> the docs and add a new GUC to control this behaviour. I'm open to
> thoughts and ideas here.

It will be great if I can hear some thoughts on the above points (as
posted upthread).

Regards,
Bharath Rupireddy.



Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Fri, Jul 8, 2022 at 9:16 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Sat, Jun 25, 2022 at 1:31 AM Cary Huang <cary.huang@highgo.ca> wrote:
> >
> > The following review has been posted through the commitfest application:
> > make installcheck-world:  tested, passed
> > Implements feature:       tested, passed
> > Spec compliant:           not tested
> > Documentation:            not tested
> >
> > Hello
> >
> > I tested this patch in a setup where the standby is in the middle of replicating and REDOing primary's WAL files
duringa very large data insertion. During this time, I keep killing the walreceiver process to cause a stream failure
andforce standby to read from archive. The system will restore from archive for "wal_retrieve_retry_interval" seconds
beforeit attempts to steam again. Without this patch, once the streaming is interrupted, it keeps reading from archive
untilstandby reaches the same consistent state of primary and then it will switch back to streaming again. So it seems
thatthe patch does the job as described and does bring some benefit during a very large REDO job where it will try to
re-streamafter restoring some WALs from archive to speed up this "catch up" process. But if the recovery job is not a
largeone, PG is already switching back to streaming once it hits consistent state. 
>
> Thanks a lot Cary for testing the patch.
>
> > Here's a v1 patch that I've come up with. I'm right now using the
> > existing GUC wal_retrieve_retry_interval to switch to stream mode from
> > archive mode as opposed to switching only after the failure to get WAL
> > from archive mode. If okay with the approach, I can add tests, change
> > the docs and add a new GUC to control this behaviour. I'm open to
> > thoughts and ideas here.
>
> It will be great if I can hear some thoughts on the above points (as
> posted upthread).

Here's the v2 patch with a separate GUC, new GUC was necessary as the
existing GUC wal_retrieve_retry_interval is loaded with multiple
usages. When the feature is enabled, it will let standby to switch to
stream mode i.e. fetch WAL from primary before even fetching from
archive fails. The switching to stream mode from archive happens in 2
scenarios: 1) when standby is in initial recovery 2) when there was a
failure in receiving from primary (walreceiver got killed or crashed
or timed out, or connectivity to primary was broken - for whatever
reasons).

I also added test cases to the v2 patch.

Please review the patch.

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

Attachment

Re: Switching XLog source from archive to streaming when primary available

From
Nathan Bossart
Date:
+      <indexterm>
+       <primary><varname>wal_source_switch_interval</varname> configuration parameter</primary>
+      </indexterm>

I don't want to bikeshed on the name too much, but I do think we need
something more descriptive.  I'm thinking of something like
streaming_replication_attempt_interval or
streaming_replication_retry_interval.

+        Specifies how long the standby server should wait before switching WAL
+        source from WAL archive to primary (streaming replication). This can
+        happen either during the standby initial recovery or after a previous
+        failed attempt to stream WAL from the primary.

I'm not sure what the second sentence means.  In general, I think the
explanation in your commit message is much clearer:

    The standby makes an attempt to read WAL from primary after
    wal_retrieve_retry_interval milliseconds reading from archive.

+        If this value is specified without units, it is taken as milliseconds.
+        The default value is 5 seconds. A setting of <literal>0</literal>
+        disables the feature.

5 seconds seems low.  I would expect the default to be 1-5 minutes.  I
think it's important to strike a balance between interrupting archive
recovery to attempt streaming replication and letting archive recovery make
progress.

+     * Try reading WAL from primary after every wal_source_switch_interval
+     * milliseconds, when state machine is in XLOG_FROM_ARCHIVE state. If
+     * successful, the state machine moves to XLOG_FROM_STREAM state, otherwise
+     * it falls back to XLOG_FROM_ARCHIVE state.

It's not clear to me how this is expected to interact with the pg_wal phase
of standby recovery.  As the docs note [0], standby servers loop through
archive recovery, recovery from pg_wal, and streaming replication.  Does
this cause the pg_wal phase to be skipped (i.e., the standby goes straight
from archive recovery to streaming replication)?  I wonder if it'd be
better for this mechanism to simply move the standby to the pg_wal phase so
that the usual ordering isn't changed.

+                    if (!first_time &&
+                        TimestampDifferenceExceeds(last_switch_time, curr_time,
+                                                   wal_source_switch_interval))

Shouldn't this also check that wal_source_switch_interval is not set to 0?

+                        elog(DEBUG2,
+                             "trying to switch WAL source to %s after fetching WAL from %s for %d milliseconds",
+                             xlogSourceNames[XLOG_FROM_STREAM],
+                             xlogSourceNames[currentSource],
+                             wal_source_switch_interval);
+
+                        last_switch_time = curr_time;

Shouldn't the last_switch_time be set when the state machine first enters
XLOG_FROM_ARCHIVE?  IIUC this logic is currently counting time spent
elsewhere (e.g., XLOG_FROM_STREAM) when determining whether to force a
source switch.  This would mean that a standby that has spent a lot of time
in streaming replication before failing would flip to XLOG_FROM_ARCHIVE,
immediately flip back to XLOG_FROM_STREAM, and then likely flip back to
XLOG_FROM_ARCHIVE when it failed again.  Given the standby will wait for
wal_retrieve_retry_interval before going back to XLOG_FROM_ARCHIVE, it
seems like we could end up rapidly looping between sources.  Perhaps I am
misunderstanding how this is meant to work.

+    {
+        {"wal_source_switch_interval", PGC_SIGHUP, REPLICATION_STANDBY,
+            gettext_noop("Sets the time to wait before switching WAL source from archive to primary"),
+            gettext_noop("0 turns this feature off."),
+            GUC_UNIT_MS
+        },
+        &wal_source_switch_interval,
+        5000, 0, INT_MAX,
+        NULL, NULL, NULL
+    },

I wonder if the lower bound should be higher to avoid switching
unnecessarily rapidly between WAL sources.  I see that
WaitForWALToBecomeAvailable() ensures that standbys do not switch from
XLOG_FROM_STREAM to XLOG_FROM_ARCHIVE more often than once per
wal_retrieve_retry_interval.  Perhaps wal_retrieve_retry_interval should be
the lower bound for this GUC, too.  Or maybe WaitForWALToBecomeAvailable()
should make sure that the standby makes at least once attempt to restore
the file from archive before switching to streaming replication.

[0] https://www.postgresql.org/docs/current/warm-standby.html#STANDBY-SERVER-OPERATION

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Wed, Sep 7, 2022 at 3:27 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> +      <indexterm>
> +       <primary><varname>wal_source_switch_interval</varname> configuration parameter</primary>
> +      </indexterm>
>
> I don't want to bikeshed on the name too much, but I do think we need
> something more descriptive.  I'm thinking of something like
> streaming_replication_attempt_interval or
> streaming_replication_retry_interval.

I could come up with wal_source_switch_interval after a log of
bikeshedding myself :). However, streaming_replication_retry_interval
looks much better, I've used it in the latest patch. Thanks.

> +        Specifies how long the standby server should wait before switching WAL
> +        source from WAL archive to primary (streaming replication). This can
> +        happen either during the standby initial recovery or after a previous
> +        failed attempt to stream WAL from the primary.
>
> I'm not sure what the second sentence means.  In general, I think the
> explanation in your commit message is much clearer:

I polished the comments, docs and commit message a bit, please check now.

> 5 seconds seems low.  I would expect the default to be 1-5 minutes.  I
> think it's important to strike a balance between interrupting archive
> recovery to attempt streaming replication and letting archive recovery make
> progress.

+1 for a default value of 5 minutes to avoid frequent interruptions
for archive mode when primary is really down for a long time. I've
also added a cautionary note in the docs about the lower values.

> +        * Try reading WAL from primary after every wal_source_switch_interval
> +        * milliseconds, when state machine is in XLOG_FROM_ARCHIVE state. If
> +        * successful, the state machine moves to XLOG_FROM_STREAM state, otherwise
> +        * it falls back to XLOG_FROM_ARCHIVE state.
>
> It's not clear to me how this is expected to interact with the pg_wal phase
> of standby recovery.  As the docs note [0], standby servers loop through
> archive recovery, recovery from pg_wal, and streaming replication.  Does
> this cause the pg_wal phase to be skipped (i.e., the standby goes straight
> from archive recovery to streaming replication)?  I wonder if it'd be
> better for this mechanism to simply move the standby to the pg_wal phase so
> that the usual ordering isn't changed.
>
> [0] https://www.postgresql.org/docs/current/warm-standby.html#STANDBY-SERVER-OPERATION

It doesn't change any behaviour as such for XLOG_FROM_PG_WAL. In
standby mode when recovery_command is specified, the initial value of
currentSource would be XLOG_FROM_ARCHIVE (see [1]). If the archive is
exhausted of WAL or the standby fails to fetch from the archive, then
it switches to XLOG_FROM_STREAM. If the standby fails to receive WAL
from primary, it switches back to XLOG_FROM_ARCHIVE. This continues
unless the standby gets promoted. With the patch, we enable the
standby to try fetching from the primary, instead of waiting for WAL
in the archive to get exhausted or for an error to occur in the
standby while receiving from the archive.

> +                                       if (!first_time &&
> +                                               TimestampDifferenceExceeds(last_switch_time, curr_time,
> +
wal_source_switch_interval))
>
> Shouldn't this also check that wal_source_switch_interval is not set to 0?

Corrected.

> +                                               elog(DEBUG2,
> +                                                        "trying to switch WAL source to %s after fetching WAL from
%sfor %d milliseconds",
 
> +                                                        xlogSourceNames[XLOG_FROM_STREAM],
> +                                                        xlogSourceNames[currentSource],
> +                                                        wal_source_switch_interval);
> +
> +                                               last_switch_time = curr_time;
>
> Shouldn't the last_switch_time be set when the state machine first enters
> XLOG_FROM_ARCHIVE?  IIUC this logic is currently counting time spent
> elsewhere (e.g., XLOG_FROM_STREAM) when determining whether to force a
> source switch.  This would mean that a standby that has spent a lot of time
> in streaming replication before failing would flip to XLOG_FROM_ARCHIVE,
> immediately flip back to XLOG_FROM_STREAM, and then likely flip back to
> XLOG_FROM_ARCHIVE when it failed again.  Given the standby will wait for
> wal_retrieve_retry_interval before going back to XLOG_FROM_ARCHIVE, it
> seems like we could end up rapidly looping between sources.  Perhaps I am
> misunderstanding how this is meant to work.

last_switch_time indicates the time when the standby last attempted to
switch to primary. For instance, a standby:
1) for the first time, sets last_switch_time = current_time when in archive mode
2) if current_time < last_switch_time + interval, continues to be in
archive mode
3) if current_time >= last_switch_time + interval, attempts to switch
to primary and sets last_switch_time = current_time
3.1) if successfully switches to primary, continues in there and for
any reason fails to fetch from primary, then enters archive mode and
loops from step (2)
3.2) if fails to switch to primary, then enters archive mode and loops
from step (2)

Hope this clarifies the behaviour.

> +       {
> +               {"wal_source_switch_interval", PGC_SIGHUP, REPLICATION_STANDBY,
> +                       gettext_noop("Sets the time to wait before switching WAL source from archive to primary"),
> +                       gettext_noop("0 turns this feature off."),
> +                       GUC_UNIT_MS
> +               },
> +               &wal_source_switch_interval,
> +               5000, 0, INT_MAX,
> +               NULL, NULL, NULL
> +       },
>
> I wonder if the lower bound should be higher to avoid switching
> unnecessarily rapidly between WAL sources.  I see that
> WaitForWALToBecomeAvailable() ensures that standbys do not switch from
> XLOG_FROM_STREAM to XLOG_FROM_ARCHIVE more often than once per
> wal_retrieve_retry_interval.  Perhaps wal_retrieve_retry_interval should be
> the lower bound for this GUC, too.  Or maybe WaitForWALToBecomeAvailable()
> should make sure that the standby makes at least once attempt to restore
> the file from archive before switching to streaming replication.

No, we need a way to disable the feature, so I'm not changing the
lower bound. And let's not make this GUC dependent on any other GUC, I
would like to keep it simple for better usability. However, I've
increased the default value to 5min and added a note in the docs about
the lower values.

I'm attaching the v3 patch with the review comments addressed, please
review it further.

[1]
    if (!InArchiveRecovery)
        currentSource = XLOG_FROM_PG_WAL;
    else if (currentSource == XLOG_FROM_ANY ||
             (!StandbyMode && currentSource == XLOG_FROM_STREAM))
    {
        lastSourceFailed = false;
        currentSource = XLOG_FROM_ARCHIVE;
    }

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Switching XLog source from archive to streaming when primary available

From
Nathan Bossart
Date:
On Thu, Sep 08, 2022 at 05:16:53PM +0530, Bharath Rupireddy wrote:
> On Wed, Sep 7, 2022 at 3:27 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> It's not clear to me how this is expected to interact with the pg_wal phase
>> of standby recovery.  As the docs note [0], standby servers loop through
>> archive recovery, recovery from pg_wal, and streaming replication.  Does
>> this cause the pg_wal phase to be skipped (i.e., the standby goes straight
>> from archive recovery to streaming replication)?  I wonder if it'd be
>> better for this mechanism to simply move the standby to the pg_wal phase so
>> that the usual ordering isn't changed.
> 
> It doesn't change any behaviour as such for XLOG_FROM_PG_WAL. In
> standby mode when recovery_command is specified, the initial value of
> currentSource would be XLOG_FROM_ARCHIVE (see [1]). If the archive is
> exhausted of WAL or the standby fails to fetch from the archive, then
> it switches to XLOG_FROM_STREAM. If the standby fails to receive WAL
> from primary, it switches back to XLOG_FROM_ARCHIVE. This continues
> unless the standby gets promoted. With the patch, we enable the
> standby to try fetching from the primary, instead of waiting for WAL
> in the archive to get exhausted or for an error to occur in the
> standby while receiving from the archive.

Okay.  I see that you are checking for XLOG_FROM_ARCHIVE.

>> Shouldn't the last_switch_time be set when the state machine first enters
>> XLOG_FROM_ARCHIVE?  IIUC this logic is currently counting time spent
>> elsewhere (e.g., XLOG_FROM_STREAM) when determining whether to force a
>> source switch.  This would mean that a standby that has spent a lot of time
>> in streaming replication before failing would flip to XLOG_FROM_ARCHIVE,
>> immediately flip back to XLOG_FROM_STREAM, and then likely flip back to
>> XLOG_FROM_ARCHIVE when it failed again.  Given the standby will wait for
>> wal_retrieve_retry_interval before going back to XLOG_FROM_ARCHIVE, it
>> seems like we could end up rapidly looping between sources.  Perhaps I am
>> misunderstanding how this is meant to work.
> 
> last_switch_time indicates the time when the standby last attempted to
> switch to primary. For instance, a standby:
> 1) for the first time, sets last_switch_time = current_time when in archive mode
> 2) if current_time < last_switch_time + interval, continues to be in
> archive mode
> 3) if current_time >= last_switch_time + interval, attempts to switch
> to primary and sets last_switch_time = current_time
> 3.1) if successfully switches to primary, continues in there and for
> any reason fails to fetch from primary, then enters archive mode and
> loops from step (2)
> 3.2) if fails to switch to primary, then enters archive mode and loops
> from step (2)

Let's say I have this new parameter set to 5 minutes, and I have a standby
that's been at step 3.1 for 5 days before failing and going back to step 2.
Won't the standby immediately jump back to step 3.1?  I think we should
place the limit on how long the server stays in XLOG_FROM_ARCHIVE, not how
long it's been since we last tried XLOG_FROM_STREAM.

>> I wonder if the lower bound should be higher to avoid switching
>> unnecessarily rapidly between WAL sources.  I see that
>> WaitForWALToBecomeAvailable() ensures that standbys do not switch from
>> XLOG_FROM_STREAM to XLOG_FROM_ARCHIVE more often than once per
>> wal_retrieve_retry_interval.  Perhaps wal_retrieve_retry_interval should be
>> the lower bound for this GUC, too.  Or maybe WaitForWALToBecomeAvailable()
>> should make sure that the standby makes at least once attempt to restore
>> the file from archive before switching to streaming replication.
> 
> No, we need a way to disable the feature, so I'm not changing the
> lower bound. And let's not make this GUC dependent on any other GUC, I
> would like to keep it simple for better usability. However, I've
> increased the default value to 5min and added a note in the docs about
> the lower values.
> 
> I'm attaching the v3 patch with the review comments addressed, please
> review it further.

My general point is that we should probably offer some basic preventative
measure against flipping back and forth between streaming and archive
recovery while making zero progress.  As I noted, maybe that's as simple as
having WaitForWALToBecomeAvailable() attempt to restore a file from archive
at least once before the new parameter forces us to switch to streaming
replication.  There might be other ways to handle this.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Switching XLog source from archive to streaming when primary available

From
Kyotaro Horiguchi
Date:
Being late for the party.

It seems to me that the function is getting too long.  I think we
might want to move the core part of the patch into another function.

I think it might be better if intentionalSourceSwitch doesn't need
lastSourceFailed set. It would look like this:

>  if (lastSourceFailed || switchSource)
>  {
>     if (nonblocking && lastSourceFailed)
>        return XLREAD_WOULDBLOCK;


+                    if (first_time)
+                        last_switch_time = curr_time;
..
+                    if (!first_time &&
+                        TimestampDifferenceExceeds(last_switch_time, curr_time,
..
+                    /* We're not here for the first time any more */
+                    if (first_time)
+                        first_time = false;

I don't think the flag first_time is needed.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Switching XLog source from archive to streaming when primary available

From
Kyotaro Horiguchi
Date:
At Thu, 8 Sep 2022 10:53:56 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in 
> On Thu, Sep 08, 2022 at 05:16:53PM +0530, Bharath Rupireddy wrote:
> > I'm attaching the v3 patch with the review comments addressed, please
> > review it further.
> 
> My general point is that we should probably offer some basic preventative
> measure against flipping back and forth between streaming and archive
> recovery while making zero progress.  As I noted, maybe that's as simple as
> having WaitForWALToBecomeAvailable() attempt to restore a file from archive
> at least once before the new parameter forces us to switch to streaming
> replication.  There might be other ways to handle this.

+1.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Fri, Sep 9, 2022 at 10:57 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Thu, 8 Sep 2022 10:53:56 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in
> > On Thu, Sep 08, 2022 at 05:16:53PM +0530, Bharath Rupireddy wrote:
> > > I'm attaching the v3 patch with the review comments addressed, please
> > > review it further.
> >
> > My general point is that we should probably offer some basic preventative
> > measure against flipping back and forth between streaming and archive
> > recovery while making zero progress.  As I noted, maybe that's as simple as
> > having WaitForWALToBecomeAvailable() attempt to restore a file from archive
> > at least once before the new parameter forces us to switch to streaming
> > replication.  There might be other ways to handle this.
>
> +1.

Hm. In that case, I think we can get rid of timeout based switching
mechanism and have this behaviour - the standby can attempt to switch
to streaming mode from archive, say, after fetching 1, 2 or a
configurable number of WAL files. In fact, this is the original idea
proposed by Satya in this thread.

If okay, I can code on that. Thoughts?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Switching XLog source from archive to streaming when primary available

From
Nathan Bossart
Date:
On Fri, Sep 09, 2022 at 12:14:25PM +0530, Bharath Rupireddy wrote:
> On Fri, Sep 9, 2022 at 10:57 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
>> At Thu, 8 Sep 2022 10:53:56 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in
>> > My general point is that we should probably offer some basic preventative
>> > measure against flipping back and forth between streaming and archive
>> > recovery while making zero progress.  As I noted, maybe that's as simple as
>> > having WaitForWALToBecomeAvailable() attempt to restore a file from archive
>> > at least once before the new parameter forces us to switch to streaming
>> > replication.  There might be other ways to handle this.
>>
>> +1.
> 
> Hm. In that case, I think we can get rid of timeout based switching
> mechanism and have this behaviour - the standby can attempt to switch
> to streaming mode from archive, say, after fetching 1, 2 or a
> configurable number of WAL files. In fact, this is the original idea
> proposed by Satya in this thread.

IMO the timeout approach would be more intuitive for users.  When it comes
to archive recovery, "WAL segment" isn't a standard unit of measure.  WAL
segment size can differ between clusters, and WAL files can have different
amounts of data or take different amounts of time to replay.  So I think it
would be difficult for the end user to decide on a value.  However, even
the timeout approach has this sort of problem.  If your parameter is set to
1 minute, but the current archive takes 5 minutes to recover, you won't
really be testing streaming replication once a minute.  That would likely
need to be documented.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Fri, Sep 9, 2022 at 10:29 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Fri, Sep 09, 2022 at 12:14:25PM +0530, Bharath Rupireddy wrote:
> > On Fri, Sep 9, 2022 at 10:57 AM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> >> At Thu, 8 Sep 2022 10:53:56 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in
> >> > My general point is that we should probably offer some basic preventative
> >> > measure against flipping back and forth between streaming and archive
> >> > recovery while making zero progress.  As I noted, maybe that's as simple as
> >> > having WaitForWALToBecomeAvailable() attempt to restore a file from archive
> >> > at least once before the new parameter forces us to switch to streaming
> >> > replication.  There might be other ways to handle this.
> >>
> >> +1.
> >
> > Hm. In that case, I think we can get rid of timeout based switching
> > mechanism and have this behaviour - the standby can attempt to switch
> > to streaming mode from archive, say, after fetching 1, 2 or a
> > configurable number of WAL files. In fact, this is the original idea
> > proposed by Satya in this thread.
>
> IMO the timeout approach would be more intuitive for users.  When it comes
> to archive recovery, "WAL segment" isn't a standard unit of measure.  WAL
> segment size can differ between clusters, and WAL files can have different
> amounts of data or take different amounts of time to replay.

How about the amount of WAL bytes fetched from the archive after which
a standby attempts to connect to primary or enter streaming mode? Of
late, we've changed some GUCs to represent bytes instead of WAL
files/segments, see [1].

> So I think it
> would be difficult for the end user to decide on a value.  However, even
> the timeout approach has this sort of problem.  If your parameter is set to
> 1 minute, but the current archive takes 5 minutes to recover, you won't
> really be testing streaming replication once a minute.  That would likely
> need to be documented.

If we have configurable WAL bytes instead of timeout for standby WAL
source switch from archive to primary, we don't have the above problem
right?

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c3fe108c025e4a080315562d4c15ecbe3f00405e

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Switching XLog source from archive to streaming when primary available

From
Nathan Bossart
Date:
On Fri, Sep 09, 2022 at 11:07:00PM +0530, Bharath Rupireddy wrote:
> On Fri, Sep 9, 2022 at 10:29 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> IMO the timeout approach would be more intuitive for users.  When it comes
>> to archive recovery, "WAL segment" isn't a standard unit of measure.  WAL
>> segment size can differ between clusters, and WAL files can have different
>> amounts of data or take different amounts of time to replay.
> 
> How about the amount of WAL bytes fetched from the archive after which
> a standby attempts to connect to primary or enter streaming mode? Of
> late, we've changed some GUCs to represent bytes instead of WAL
> files/segments, see [1].

Well, for wal_keep_size, using bytes makes sense.  Given you know how much
disk space you have, you can set this parameter accordingly to avoid
retaining too much of it for standby servers.  For your proposed parameter,
it's not so simple.  The same setting could have wildly different timing
behavior depending on the server.  I still think that a timeout is the most
intuitive.

>> So I think it
>> would be difficult for the end user to decide on a value.  However, even
>> the timeout approach has this sort of problem.  If your parameter is set to
>> 1 minute, but the current archive takes 5 minutes to recover, you won't
>> really be testing streaming replication once a minute.  That would likely
>> need to be documented.
> 
> If we have configurable WAL bytes instead of timeout for standby WAL
> source switch from archive to primary, we don't have the above problem
> right?

If you are going to stop replaying in the middle of a WAL archive, then
maybe.  But I don't think I'd recommend that.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Sat, Sep 10, 2022 at 3:35 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> Well, for wal_keep_size, using bytes makes sense.  Given you know how much
> disk space you have, you can set this parameter accordingly to avoid
> retaining too much of it for standby servers.  For your proposed parameter,
> it's not so simple.  The same setting could have wildly different timing
> behavior depending on the server.  I still think that a timeout is the most
> intuitive.

Hm. In v3 patch, I've used the timeout approach, but tracking the
duration server spends in XLOG_FROM_ARCHIVE as opposed to tracking
last failed time in streaming from primary.

> So I think it
> would be difficult for the end user to decide on a value.  However, even
> the timeout approach has this sort of problem.  If your parameter is set to
> 1 minute, but the current archive takes 5 minutes to recover, you won't
> really be testing streaming replication once a minute.  That would likely
> need to be documented.

Added a note in the docs.

On Fri, Sep 9, 2022 at 10:46 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> Being late for the party.

Thanks for reviewing this.

> It seems to me that the function is getting too long.  I think we
> might want to move the core part of the patch into another function.

Yeah, the WaitForWALToBecomeAvailable() (without this patch) has
around 460 LOC out of which WAL fetching from the chosen source is of
240 LOC, IMO, this code will be a candidate for a new function. I
think that part can be discussed separately.

Having said that, I moved the new code to a new function.

> I think it might be better if intentionalSourceSwitch doesn't need
> lastSourceFailed set. It would look like this:
>
> >  if (lastSourceFailed || switchSource)
> >  {
> >     if (nonblocking && lastSourceFailed)
> >        return XLREAD_WOULDBLOCK;

I think the above looks good, done that way in the latest patch.

> I don't think the flag first_time is needed.

Addressed this in the v4 patch.

Please review the attached v4 patch addressing above review comments.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Mon, Sep 12, 2022 at 9:03 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Please review the attached v4 patch addressing above review comments.

Oops, there's a compiler warning [1] with the v4 patch, fixed it.
Please review the attached v5 patch.

[1] https://cirrus-ci.com/task/5730076611313664?logs=gcc_warning#L450

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Mon, Sep 12, 2022 at 11:56 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Please review the attached v5 patch.

I'm attaching the v6 patch that's rebased on to the latest HEAD.
Please consider this for review.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Switching XLog source from archive to streaming when primary available

From
Kyotaro Horiguchi
Date:
At Thu, 15 Sep 2022 10:28:12 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in 
> I'm attaching the v6 patch that's rebased on to the latest HEAD.
> Please consider this for review.

Thaks for the new version!

+#define StreamingReplRetryEnabled() \
+    (streaming_replication_retry_interval > 0 && \
+     StandbyMode && \
+     currentSource == XLOG_FROM_ARCHIVE)

It seems to me a bit too complex..

+            /* Save the timestamp at which we're switching to archive. */
+            if (StreamingReplRetryEnabled())
+                switched_to_archive_at = GetCurrentTimestamp();

Anyway we are going to open a file just after this so
GetCurrentTimestamp() cannot cause a perceptible degradation.
Coulnd't we do that unconditionally, to get rid of the macro?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Thu, Sep 15, 2022 at 1:52 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Thu, 15 Sep 2022 10:28:12 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in
> > I'm attaching the v6 patch that's rebased on to the latest HEAD.
> > Please consider this for review.
>
> Thaks for the new version!
>
> +#define StreamingReplRetryEnabled() \
> +       (streaming_replication_retry_interval > 0 && \
> +        StandbyMode && \
> +        currentSource == XLOG_FROM_ARCHIVE)
>
> It seems to me a bit too complex..

I don't think so, it just tells whether a standby is allowed to switch
source to stream from archive.

> +                       /* Save the timestamp at which we're switching to archive. */
> +                       if (StreamingReplRetryEnabled())
> +                               switched_to_archive_at = GetCurrentTimestamp();
>
> Anyway we are going to open a file just after this so
> GetCurrentTimestamp() cannot cause a perceptible degradation.
> Coulnd't we do that unconditionally, to get rid of the macro?

Do we really need to do it unconditionally? I don't think so. And, we
can't get rid of the macro, as we need to check for the current
source, GUC and standby mode. When this feature is disabled, it
mustn't execute any extra code IMO.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Switching XLog source from archive to streaming when primary available

From
Kyotaro Horiguchi
Date:
At Fri, 16 Sep 2022 09:15:58 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in 
> On Thu, Sep 15, 2022 at 1:52 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > At Thu, 15 Sep 2022 10:28:12 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in
> > > I'm attaching the v6 patch that's rebased on to the latest HEAD.
> > > Please consider this for review.
> >
> > Thaks for the new version!
> >
> > +#define StreamingReplRetryEnabled() \
> > +       (streaming_replication_retry_interval > 0 && \
> > +        StandbyMode && \
> > +        currentSource == XLOG_FROM_ARCHIVE)
> >
> > It seems to me a bit too complex..

In other words, it seems to me that the macro name doesn't manifest
the condition correctly.

> I don't think so, it just tells whether a standby is allowed to switch
> source to stream from archive.
> 
> > +                       /* Save the timestamp at which we're switching to archive. */
> > +                       if (StreamingReplRetryEnabled())
> > +                               switched_to_archive_at = GetCurrentTimestamp();
> >
> > Anyway we are going to open a file just after this so
> > GetCurrentTimestamp() cannot cause a perceptible degradation.
> > Coulnd't we do that unconditionally, to get rid of the macro?
> 
> Do we really need to do it unconditionally? I don't think so. And, we
> can't get rid of the macro, as we need to check for the current
> source, GUC and standby mode. When this feature is disabled, it
> mustn't execute any extra code IMO.

I don't think we don't particularly want to do that unconditionally.
I wanted just to get rid of the macro from the usage site.  Even if
the same condition is used elsewhere, I see it better to write out the
condition directly there..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Fri, Sep 16, 2022 at 12:06 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> In other words, it seems to me that the macro name doesn't manifest
> the condition correctly.
>
> I don't think we don't particularly want to do that unconditionally.
> I wanted just to get rid of the macro from the usage site.  Even if
> the same condition is used elsewhere, I see it better to write out the
> condition directly there..

I wanted to avoid a bit of duplicate code there. How about naming that
macro IsXLOGSourceSwitchToStreamEnabled() or
SwitchFromArchiveToStreamEnabled() or just SwitchFromArchiveToStream()
or any other better name?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Fri, Sep 16, 2022 at 4:58 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Fri, Sep 16, 2022 at 12:06 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > In other words, it seems to me that the macro name doesn't manifest
> > the condition correctly.
> >
> > I don't think we don't particularly want to do that unconditionally.
> > I wanted just to get rid of the macro from the usage site.  Even if
> > the same condition is used elsewhere, I see it better to write out the
> > condition directly there..
>
> I wanted to avoid a bit of duplicate code there. How about naming that
> macro IsXLOGSourceSwitchToStreamEnabled() or
> SwitchFromArchiveToStreamEnabled() or just SwitchFromArchiveToStream()
> or any other better name?

SwitchFromArchiveToStreamEnabled() seemed better at this point. I'm
attaching the v7 patch with that change. Please review it further.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Switching XLog source from archive to streaming when primary available

From
Nathan Bossart
Date:
On Mon, Sep 19, 2022 at 07:49:21PM +0530, Bharath Rupireddy wrote:
> SwitchFromArchiveToStreamEnabled() seemed better at this point. I'm
> attaching the v7 patch with that change. Please review it further.

As I mentioned upthread [0], I'm still a little concerned that this patch
will cause the state machine to go straight from archive recovery to
streaming replication, skipping recovery from pg_wal.  I wonder if this
could be resolved by moving the standby to the pg_wal phase instead.
Concretely, this line

+                if (switchSource)
+                    break;

would instead change currentSource from XLOG_FROM_ARCHIVE to
XLOG_FROM_PG_WAL before the call to XLogFileReadAnyTLI().  I suspect the
behavior would be basically the same, but it would maintain the existing
ordering.

However, I do see the following note elsewhere in xlogrecovery.c:

 * The segment can be fetched via restore_command, or via walreceiver having
 * streamed the record, or it can already be present in pg_wal. Checking
 * pg_wal is mainly for crash recovery, but it will be polled in standby mode
 * too, in case someone copies a new segment directly to pg_wal. That is not
 * documented or recommended, though.

Given this information, the present behavior might not be too important,
but I don't see a point in changing it without good reason.

[0] https://postgr.es/m/20220906215704.GA2084086%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Sun, Oct 9, 2022 at 3:22 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> As I mentioned upthread [0], I'm still a little concerned that this patch
> will cause the state machine to go straight from archive recovery to
> streaming replication, skipping recovery from pg_wal.
>
> [0] https://postgr.es/m/20220906215704.GA2084086%40nathanxps13

Yes, it goes straight to streaming replication skipping recovery from
pg_wal with the patch.

> I wonder if this
> could be resolved by moving the standby to the pg_wal phase instead.
> Concretely, this line
>
> +                               if (switchSource)
> +                                       break;
>
> would instead change currentSource from XLOG_FROM_ARCHIVE to
> XLOG_FROM_PG_WAL before the call to XLogFileReadAnyTLI().  I suspect the
> behavior would be basically the same, but it would maintain the existing
> ordering.

We can give it a chance to restore from pg_wal before switching to
streaming to not change any behaviour of the state machine. But, not
definitely by setting currentSource to XLOG_FROM_WAL, we basically
never explicitly set currentSource to XLOG_FROM_WAL, other than when
not in archive recovery i.e. InArchiveRecovery is false. Also, see the
comment [1].

Instead, the simplest would be to just pass XLOG_FROM_WAL to
XLogFileReadAnyTLI() when we're about to switch the source to stream
mode. This doesn't change the existing behaviour.

> However, I do see the following note elsewhere in xlogrecovery.c:
>
>  * The segment can be fetched via restore_command, or via walreceiver having
>  * streamed the record, or it can already be present in pg_wal. Checking
>  * pg_wal is mainly for crash recovery, but it will be polled in standby mode
>  * too, in case someone copies a new segment directly to pg_wal. That is not
>  * documented or recommended, though.
>
> Given this information, the present behavior might not be too important,
> but I don't see a point in changing it without good reason.

Yeah, with the attached patch we don't skip pg_wal before switching to
streaming mode.

I've also added a note in the 'Standby Server Operation' section about
the new feature.

Please review the v8 patch further.

Unrelated to this patch, the fact that the standby polls pg_wal is not
documented or recommended, is not true, it is actually documented [2].
Whether or not we change the docs to be something like [3], is a
separate discussion.

[1]
            /*
             * We just successfully read a file in pg_wal. We prefer files in
             * the archive over ones in pg_wal, so try the next file again
             * from the archive first.
             */

[2] https://www.postgresql.org/docs/current/warm-standby.html#STANDBY-SERVER-OPERATION
The standby server will also attempt to restore any WAL found in the
standby cluster's pg_wal directory. That typically happens after a
server restart, when the standby replays again WAL that was streamed
from the primary before the restart, but you can also manually copy
files to pg_wal at any time to have them replayed.

[3]
The standby server will also attempt to restore any WAL found in the
standby cluster's pg_wal directory. That typically happens after a
server restart, when the standby replays again WAL that was streamed
from the primary before the restart, but you can also manually copy
files to pg_wal at any time to have them replayed. However, copying of
WAL files manually is not recommended.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Switching XLog source from archive to streaming when primary available

From
Nathan Bossart
Date:
On Sun, Oct 09, 2022 at 02:39:47PM +0530, Bharath Rupireddy wrote:
> We can give it a chance to restore from pg_wal before switching to
> streaming to not change any behaviour of the state machine. But, not
> definitely by setting currentSource to XLOG_FROM_WAL, we basically
> never explicitly set currentSource to XLOG_FROM_WAL, other than when
> not in archive recovery i.e. InArchiveRecovery is false. Also, see the
> comment [1].
> 
> Instead, the simplest would be to just pass XLOG_FROM_WAL to
> XLogFileReadAnyTLI() when we're about to switch the source to stream
> mode. This doesn't change the existing behaviour.

It might be more consistent with existing behavior, but one thing I hadn't
considered is that it might make your proposed feature ineffective when
users are copying files straight into pg_wal.  IIUC as long as the files
are present in pg_wal, the source-switch logic won't kick in.

> Unrelated to this patch, the fact that the standby polls pg_wal is not
> documented or recommended, is not true, it is actually documented [2].
> Whether or not we change the docs to be something like [3], is a
> separate discussion.

I wonder if it would be better to simply remove this extra polling of
pg_wal as a prerequisite to your patch.  The existing commentary leads me
to think there might not be a strong reason for this behavior, so it could
be a nice way to simplify your patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Mon, Oct 10, 2022 at 3:17 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> > Instead, the simplest would be to just pass XLOG_FROM_WAL to
> > XLogFileReadAnyTLI() when we're about to switch the source to stream
> > mode. This doesn't change the existing behaviour.
>
> It might be more consistent with existing behavior, but one thing I hadn't
> considered is that it might make your proposed feature ineffective when
> users are copying files straight into pg_wal.  IIUC as long as the files
> are present in pg_wal, the source-switch logic won't kick in.

It happens even now, that is, the server will not switch to streaming
mode from the archive after a failure if there's someone continuously
copying WAL files to the pg_wal directory. I have not personally seen
anyone or any service doing that. It doesn't mean that can't happen.
They might do it for some purpose such as 1) to bring back in sync
quickly a standby that's lagging behind the primary after the archive
connection and/or streaming replication connection are/is broken but
many WAL files leftover on the primary 2) before promoting a standby
that's lagging behind the primary for failover or other purposes.
However, I'm not sure if someone does these things on production
servers.

> > Unrelated to this patch, the fact that the standby polls pg_wal is not
> > documented or recommended, is not true, it is actually documented [2].
> > Whether or not we change the docs to be something like [3], is a
> > separate discussion.
>
> I wonder if it would be better to simply remove this extra polling of
> pg_wal as a prerequisite to your patch.  The existing commentary leads me
> to think there might not be a strong reason for this behavior, so it could
> be a nice way to simplify your patch.

I don't think it's a good idea to remove that completely. As said
above, it might help someone, we never know.

I think for this feature, we just need to decide on whether or not
we'd allow pg_wal polling before switching to streaming mode. If we
allow it like in the v8 patch, we can document the behavior.

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Switching XLog source from archive to streaming when primary available

From
Nathan Bossart
Date:
On Mon, Oct 10, 2022 at 11:33:57AM +0530, Bharath Rupireddy wrote:
> On Mon, Oct 10, 2022 at 3:17 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> I wonder if it would be better to simply remove this extra polling of
>> pg_wal as a prerequisite to your patch.  The existing commentary leads me
>> to think there might not be a strong reason for this behavior, so it could
>> be a nice way to simplify your patch.
> 
> I don't think it's a good idea to remove that completely. As said
> above, it might help someone, we never know.

It would be great to hear whether anyone is using this functionality.  If
no one is aware of existing usage and there is no interest in keeping it
around, I don't think it would be unreasonable to remove it in v16.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Tue, Oct 11, 2022 at 8:40 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Mon, Oct 10, 2022 at 11:33:57AM +0530, Bharath Rupireddy wrote:
> > On Mon, Oct 10, 2022 at 3:17 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >> I wonder if it would be better to simply remove this extra polling of
> >> pg_wal as a prerequisite to your patch.  The existing commentary leads me
> >> to think there might not be a strong reason for this behavior, so it could
> >> be a nice way to simplify your patch.
> >
> > I don't think it's a good idea to remove that completely. As said
> > above, it might help someone, we never know.
>
> It would be great to hear whether anyone is using this functionality.  If
> no one is aware of existing usage and there is no interest in keeping it
> around, I don't think it would be unreasonable to remove it in v16.

It seems like exhausting all the WAL in pg_wal before switching to
streaming after failing to fetch from archive is unremovable. I found
this after experimenting with it, here are my findings:
1. The standby has to recover initial WAL files in the pg_wal
directory even for the normal post-restart/first-time-start case, I
mean, in non-crash recovery case.
2. The standby received WAL files from primary (walreceiver just
writes and flushes the received WAL to WAL files under pg_wal)
pretty-fast and/or standby recovery is slow, say both the standby
connection to primary and archive connection are broken for whatever
reasons, then it has WAL files to recover in pg_wal directory.

I think the fundamental behaviour for the standy is that it has to
fully recover to the end of WAL under pg_wal no matter who copies WAL
files there. I fully understand the consequences of manually copying
WAL files into pg_wal, for that matter, manually copying/tinkering any
other files into/under the data directory is something we don't
recommend and encourage.

In summary, the standby state machine in WaitForWALToBecomeAvailable()
exhausts all the WAL in pg_wal before switching to streaming after
failing to fetch from archive. The v8 patch proposed upthread deviates
from this behaviour. Hence, attaching v9 patch that keeps the
behaviour as-is, that means, the standby exhausts all the WAL in
pg_wal before switching to streaming after fetching WAL from archive
for at least streaming_replication_retry_interval milliseconds.

Please review the v9 patch further.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Switching XLog source from archive to streaming when primary available

From
Ian Lawrence Barwick
Date:
2022年10月18日(火) 11:02 Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>:
>
> On Tue, Oct 11, 2022 at 8:40 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >
> > On Mon, Oct 10, 2022 at 11:33:57AM +0530, Bharath Rupireddy wrote:
> > > On Mon, Oct 10, 2022 at 3:17 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> > >> I wonder if it would be better to simply remove this extra polling of
> > >> pg_wal as a prerequisite to your patch.  The existing commentary leads me
> > >> to think there might not be a strong reason for this behavior, so it could
> > >> be a nice way to simplify your patch.
> > >
> > > I don't think it's a good idea to remove that completely. As said
> > > above, it might help someone, we never know.
> >
> > It would be great to hear whether anyone is using this functionality.  If
> > no one is aware of existing usage and there is no interest in keeping it
> > around, I don't think it would be unreasonable to remove it in v16.
>
> It seems like exhausting all the WAL in pg_wal before switching to
> streaming after failing to fetch from archive is unremovable. I found
> this after experimenting with it, here are my findings:
> 1. The standby has to recover initial WAL files in the pg_wal
> directory even for the normal post-restart/first-time-start case, I
> mean, in non-crash recovery case.
> 2. The standby received WAL files from primary (walreceiver just
> writes and flushes the received WAL to WAL files under pg_wal)
> pretty-fast and/or standby recovery is slow, say both the standby
> connection to primary and archive connection are broken for whatever
> reasons, then it has WAL files to recover in pg_wal directory.
>
> I think the fundamental behaviour for the standy is that it has to
> fully recover to the end of WAL under pg_wal no matter who copies WAL
> files there. I fully understand the consequences of manually copying
> WAL files into pg_wal, for that matter, manually copying/tinkering any
> other files into/under the data directory is something we don't
> recommend and encourage.
>
> In summary, the standby state machine in WaitForWALToBecomeAvailable()
> exhausts all the WAL in pg_wal before switching to streaming after
> failing to fetch from archive. The v8 patch proposed upthread deviates
> from this behaviour. Hence, attaching v9 patch that keeps the
> behaviour as-is, that means, the standby exhausts all the WAL in
> pg_wal before switching to streaming after fetching WAL from archive
> for at least streaming_replication_retry_interval milliseconds.
>
> Please review the v9 patch further.

Thanks for the updated patch.

While reviewing the patch backlog, we have determined that this patch adds
one or more TAP tests but has not added the test to the "meson.build" file.

To do this, locate the relevant "meson.build" file for each test and add it
in the 'tests' dictionary, which will look something like this:

  'tap': {
    'tests': [
      't/001_basic.pl',
    ],
  },

For some additional details please see this Wiki article:

  https://wiki.postgresql.org/wiki/Meson_for_patch_authors

For more information on the meson build system for PostgreSQL see:

  https://wiki.postgresql.org/wiki/Meson


Regards

Ian Barwick



Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Wed, Nov 16, 2022 at 9:38 AM Ian Lawrence Barwick <barwick@gmail.com> wrote:
>
> While reviewing the patch backlog, we have determined that this patch adds
> one or more TAP tests but has not added the test to the "meson.build" file.

Thanks for pointing it out. Yeah, the test wasn't picking up on meson
builds. I added the new test file name in
src/test/recovery/meson.build.

I'm attaching the v10 patch for further review.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Switching XLog source from archive to streaming when primary available

From
Nathan Bossart
Date:
On Tue, Oct 18, 2022 at 07:31:33AM +0530, Bharath Rupireddy wrote:
> In summary, the standby state machine in WaitForWALToBecomeAvailable()
> exhausts all the WAL in pg_wal before switching to streaming after
> failing to fetch from archive. The v8 patch proposed upthread deviates
> from this behaviour. Hence, attaching v9 patch that keeps the
> behaviour as-is, that means, the standby exhausts all the WAL in
> pg_wal before switching to streaming after fetching WAL from archive
> for at least streaming_replication_retry_interval milliseconds.

I think this is okay.  The following comment explains why archives are
preferred over existing files in pg_wal:

     * When doing archive recovery, we always prefer an archived log file even
     * if a file of the same name exists in XLOGDIR.  The reason is that the
     * file in XLOGDIR could be an old, un-filled or partly-filled version
     * that was copied and restored as part of backing up $PGDATA.

With your patch, we might replay one of these "old" files in pg_wal instead
of the complete version of the file from the archives, but I think that is
still correct.  We'll just replay whatever exists in pg_wal (which may be
un-filled or partly-filled) before attempting streaming.  If that fails,
we'll go back to trying the archives again.

Would you mind testing this scenario?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Thu, Jan 12, 2023 at 6:21 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Tue, Oct 18, 2022 at 07:31:33AM +0530, Bharath Rupireddy wrote:
> > In summary, the standby state machine in WaitForWALToBecomeAvailable()
> > exhausts all the WAL in pg_wal before switching to streaming after
> > failing to fetch from archive. The v8 patch proposed upthread deviates
> > from this behaviour. Hence, attaching v9 patch that keeps the
> > behaviour as-is, that means, the standby exhausts all the WAL in
> > pg_wal before switching to streaming after fetching WAL from archive
> > for at least streaming_replication_retry_interval milliseconds.
>
> I think this is okay.  The following comment explains why archives are
> preferred over existing files in pg_wal:
>
>          * When doing archive recovery, we always prefer an archived log file even
>          * if a file of the same name exists in XLOGDIR.  The reason is that the
>          * file in XLOGDIR could be an old, un-filled or partly-filled version
>          * that was copied and restored as part of backing up $PGDATA.
>
> With your patch, we might replay one of these "old" files in pg_wal instead
> of the complete version of the file from the archives,

That's true even today, without the patch, no? We're not changing the
existing behaviour of the state machine. Can you explain how it
happens with the patch?

On HEAD, after failing to read from the archive, exhaust all wal from
pg_wal and then switch to streaming mode. With the patch, after
reading from the archive for at least
streaming_replication_retry_interval milliseconds, exhaust all wal
from pg_wal and then switch to streaming mode.

> but I think that is
> still correct.  We'll just replay whatever exists in pg_wal (which may be
> un-filled or partly-filled) before attempting streaming.  If that fails,
> we'll go back to trying the archives again.
>
> Would you mind testing this scenario?

How about something like below for testing the above scenario? If it
looks okay, I can add it as a new TAP test file.

1. Generate WAL files f1 and f2 and archive them.
2. Check the replay lsn and WAL file name on the standby, when it
replays upto f2, stop the standby.
3. Set recovery to fail on the standby, and stop the standby.
4. Generate f3, f4 (partially filled) on the primary.
5. Manually copy f3, f4 to the standby's pg_wal.
6. Start the standby, since recovery is set to fail, and there're new
WAL files (f3, f4) under its pg_wal, it must replay those WAL files
(check the replay lsn and WAL file name, it must be f4) before
switching to streaming.
7. Generate f5 on the primary.
8. The standby should receive f5 and replay it (check the replay lsn
and WAL file name, it must be f5).
9. Set streaming to fail on the standby and set recovery to succeed.
10. Generate f6 on the primary.
11. The standby should receive f6 via archive and replay it (check the
replay lsn and WAL file name, it must be f6).

If needed, we can look out for these messages to confirm it works as expected:
            elog(DEBUG2, "switched WAL source from %s to %s after %s",
                 xlogSourceNames[oldSource], xlogSourceNames[currentSource],
                 lastSourceFailed ? "failure" : "success");
                ereport(LOG,
                        (errmsg("restored log file \"%s\" from archive",
                                xlogfname)));

Essentially, it covers what the documentation
https://www.postgresql.org/docs/devel/warm-standby.html says:

"In standby mode, the server continuously applies WAL received from
the primary server. The standby server can read WAL from a WAL archive
(see restore_command) or directly from the primary over a TCP
connection (streaming replication). The standby server will also
attempt to restore any WAL found in the standby cluster's pg_wal
directory. That typically happens after a server restart, when the
standby replays again WAL that was streamed from the primary before
the restart, but you can also manually copy files to pg_wal at any
time to have them replayed."

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Switching XLog source from archive to streaming when primary available

From
Nathan Bossart
Date:
On Tue, Jan 17, 2023 at 07:44:52PM +0530, Bharath Rupireddy wrote:
> On Thu, Jan 12, 2023 at 6:21 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> With your patch, we might replay one of these "old" files in pg_wal instead
>> of the complete version of the file from the archives,
> 
> That's true even today, without the patch, no? We're not changing the
> existing behaviour of the state machine. Can you explain how it
> happens with the patch?

My point is that on HEAD, we will always prefer a complete archive file.
With your patch, we might instead choose to replay an old file in pg_wal
because we are artificially advancing the state machine.  IOW even if
there's a complete archive available, we might not use it.  This is a
behavior change, but I think it is okay.

>> Would you mind testing this scenario?
> 
> How about something like below for testing the above scenario? If it
> looks okay, I can add it as a new TAP test file.
> 
> 1. Generate WAL files f1 and f2 and archive them.
> 2. Check the replay lsn and WAL file name on the standby, when it
> replays upto f2, stop the standby.
> 3. Set recovery to fail on the standby, and stop the standby.
> 4. Generate f3, f4 (partially filled) on the primary.
> 5. Manually copy f3, f4 to the standby's pg_wal.
> 6. Start the standby, since recovery is set to fail, and there're new
> WAL files (f3, f4) under its pg_wal, it must replay those WAL files
> (check the replay lsn and WAL file name, it must be f4) before
> switching to streaming.
> 7. Generate f5 on the primary.
> 8. The standby should receive f5 and replay it (check the replay lsn
> and WAL file name, it must be f5).
> 9. Set streaming to fail on the standby and set recovery to succeed.
> 10. Generate f6 on the primary.
> 11. The standby should receive f6 via archive and replay it (check the
> replay lsn and WAL file name, it must be f6).

I meant testing the scenario where there's an old file in pg_wal, a
complete file in the archives, and your new GUC forces replay of the
former.  This might be difficult to do in a TAP test.  Ultimately, I just
want to validate the assumptions discussed above.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Thu, Jan 19, 2023 at 6:20 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Tue, Jan 17, 2023 at 07:44:52PM +0530, Bharath Rupireddy wrote:
> > On Thu, Jan 12, 2023 at 6:21 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >> With your patch, we might replay one of these "old" files in pg_wal instead
> >> of the complete version of the file from the archives,
> >
> > That's true even today, without the patch, no? We're not changing the
> > existing behaviour of the state machine. Can you explain how it
> > happens with the patch?
>
> My point is that on HEAD, we will always prefer a complete archive file.
> With your patch, we might instead choose to replay an old file in pg_wal
> because we are artificially advancing the state machine.  IOW even if
> there's a complete archive available, we might not use it.  This is a
> behavior change, but I think it is okay.

Oh, yeah, I too agree that it's okay because manually copying WAL
files directly to pg_wal (which eventually get replayed before
switching to streaming) isn't recommended anyway for production level
servers. I think, we covered it in the documentation that it exhausts
all the WAL present in pg_wal before switching. Isn't that enough?

+        Specifies amount of time after which standby attempts to switch WAL
+        source from WAL archive to streaming replication (get WAL from
+        primary). However, exhaust all the WAL present in pg_wal before
+        switching. If the standby fails to switch to stream mode, it falls
+        back to archive mode.

> >> Would you mind testing this scenario?
ndby should receive f6 via archive and replay it (check the
> > replay lsn an> >
>
> I meant testing the scenario where there's an old file in pg_wal, a
> complete file in the archives, and your new GUC forces replay of the
> former.  This might be difficult to do in a TAP test.  Ultimately, I just
> want to validate the assumptions discussed above.

I think testing the scenario [1] is achievable. I could write a TAP
test for it - https://github.com/BRupireddy/postgres/tree/prefer_archived_wal_v1.
It's a bit flaky and needs a little more work (1 - writing a custom
script for restore_command  that sleeps only after fetching an
existing WAL file from archive, not sleeping for a history file or a
non-existent WAL file. 2- finding a command-line way to sleep on
Windows.) to stabilize it, but it seems doable. I can spend some more
time, if one thinks that the test is worth adding to the core, perhaps
discussing it separately from this thread.

[1] RestoreArchivedFile():
    /*
     * When doing archive recovery, we always prefer an archived log file even
     * if a file of the same name exists in XLOGDIR.  The reason is that the
     * file in XLOGDIR could be an old, un-filled or partly-filled version
     * that was copied and restored as part of backing up $PGDATA.
     *

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Wed, Nov 16, 2022 at 11:39 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> I'm attaching the v10 patch for further review.

Needed a rebase. I'm attaching the v11 patch for further review.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Fri, Feb 24, 2023 at 10:26 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Wed, Nov 16, 2022 at 11:39 AM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > I'm attaching the v10 patch for further review.
>
> Needed a rebase. I'm attaching the v11 patch for further review.

Needed a rebase, so attaching the v12 patch. I word-smithed comments
and docs a bit.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Tue, Apr 25, 2023 at 9:27 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> > Needed a rebase. I'm attaching the v11 patch for further review.
>
> Needed a rebase, so attaching the v12 patch. I word-smithed comments
> and docs a bit.

Needed a rebase. I'm attaching the v13 patch for further consideration.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Fri, Jul 21, 2023 at 12:38 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Needed a rebase. I'm attaching the v13 patch for further consideration.

Needed a rebase. I'm attaching the v14 patch. It also has the following changes:

- Ran pgindent on the new source code.
- Ran pgperltidy on the new TAP test.
- Improved the newly added TAP test a bit. Used the new wait_for_log
core TAP function in place of custom find_in_log.

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Sat, Oct 21, 2023 at 11:59 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Fri, Jul 21, 2023 at 12:38 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > Needed a rebase. I'm attaching the v13 patch for further consideration.
>
> Needed a rebase. I'm attaching the v14 patch. It also has the following changes:
>
> - Ran pgindent on the new source code.
> - Ran pgperltidy on the new TAP test.
> - Improved the newly added TAP test a bit. Used the new wait_for_log
> core TAP function in place of custom find_in_log.
>
> Thoughts?

I took a closer look at v14 and came up with the following changes:

1. Used advance_wal introduced by commit c161ab74f7.
2. Simplified the core logic and new TAP tests.
3. Reworded the comments and docs.
4. Simplified new DEBUG messages.

I've attached the v15 patch for further review.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Thu, Dec 28, 2023 at 5:26 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> I took a closer look at v14 and came up with the following changes:
>
> 1. Used advance_wal introduced by commit c161ab74f7.
> 2. Simplified the core logic and new TAP tests.
> 3. Reworded the comments and docs.
> 4. Simplified new DEBUG messages.
>
> I've attached the v15 patch for further review.

Per a recent commit c538592, FATAL-ized perl warnings in the newly
added TAP test and attached the v16 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Wed, Jan 3, 2024 at 4:58 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Thu, Dec 28, 2023 at 5:26 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > I took a closer look at v14 and came up with the following changes:
> >
> > 1. Used advance_wal introduced by commit c161ab74f7.
> > 2. Simplified the core logic and new TAP tests.
> > 3. Reworded the comments and docs.
> > 4. Simplified new DEBUG messages.
> >
> > I've attached the v15 patch for further review.
>
> Per a recent commit c538592, FATAL-ized perl warnings in the newly
> added TAP test and attached the v16 patch.

Needed a rebase due to commit 776621a (conflict in
src/test/recovery/meson.build for new TAP test file added). Please
find the attached v17 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Wed, Jan 31, 2024 at 6:30 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Needed a rebase due to commit 776621a (conflict in
> src/test/recovery/meson.build for new TAP test file added). Please
> find the attached v17 patch.

Strengthened tests a bit by using recovery_min_apply_delay to mimic
standby spending some time fetching from archive. PSA v18 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment
On Mon, 19 Feb 2024 at 18:36, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
> On Wed, Jan 31, 2024 at 6:30 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
>>
>> Needed a rebase due to commit 776621a (conflict in
>> src/test/recovery/meson.build for new TAP test file added). Please
>> find the attached v17 patch.
>
> Strengthened tests a bit by using recovery_min_apply_delay to mimic
> standby spending some time fetching from archive. PSA v18 patch.

Here are some minor comments:

[1]
+        primary). However, the standby exhausts all the WAL present in pg_wal

s|pg_wal|<filename>pg_wal</filename>|g

[2]
+# Ensure checkpoint doesn't come in our way
+$primary->append_conf('postgresql.conf', qq(
+    min_wal_size = 2MB
+    max_wal_size = 1GB
+    checkpoint_timeout = 1h
+    autovacuum = off
+));

Keeping the same indentation might be better.



Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Mon, Feb 19, 2024 at 8:25 PM Japin Li <japinli@hotmail.com> wrote:
>
> > Strengthened tests a bit by using recovery_min_apply_delay to mimic
> > standby spending some time fetching from archive. PSA v18 patch.
>
> Here are some minor comments:

Thanks for taking a look at it.

> [1]
> +        primary). However, the standby exhausts all the WAL present in pg_wal
>
> s|pg_wal|<filename>pg_wal</filename>|g

Done.

> [2]
> +# Ensure checkpoint doesn't come in our way
> +$primary->append_conf('postgresql.conf', qq(
> +    min_wal_size = 2MB
> +    max_wal_size = 1GB
> +    checkpoint_timeout = 1h
> +       autovacuum = off
> +));
>
> Keeping the same indentation might be better.

The autovacuum line looks mis-indented in the patch file. However, I
now ran src/tools/pgindent/perltidyrc
src/test/recovery/t/041_wal_source_switch.pl on it.

Please see the attached v19 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment
On Tue, 20 Feb 2024 at 13:40, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
> On Mon, Feb 19, 2024 at 8:25 PM Japin Li <japinli@hotmail.com> wrote:
>> [2]
>> +# Ensure checkpoint doesn't come in our way
>> +$primary->append_conf('postgresql.conf', qq(
>> +    min_wal_size = 2MB
>> +    max_wal_size = 1GB
>> +    checkpoint_timeout = 1h
>> +       autovacuum = off
>> +));
>>
>> Keeping the same indentation might be better.
>
> The autovacuum line looks mis-indented in the patch file. However, I
> now ran src/tools/pgindent/perltidyrc
> src/test/recovery/t/041_wal_source_switch.pl on it.
>

Thanks for updating the patch.  It seems still with the wrong indent.

diff --git a/src/test/recovery/t/041_wal_source_switch.pl b/src/test/recovery/t/041_wal_source_switch.pl
index 082680bf4a..b5eddba1d5 100644
--- a/src/test/recovery/t/041_wal_source_switch.pl
+++ b/src/test/recovery/t/041_wal_source_switch.pl
@@ -18,9 +18,9 @@ $primary->init(
 # Ensure checkpoint doesn't come in our way
 $primary->append_conf(
     'postgresql.conf', qq(
-    min_wal_size = 2MB
-    max_wal_size = 1GB
-    checkpoint_timeout = 1h
+    min_wal_size = 2MB
+    max_wal_size = 1GB
+    checkpoint_timeout = 1h
     autovacuum = off
 ));
 $primary->start;
@@ -85,7 +85,7 @@ my $offset = -s $standby->logfile;
 my $apply_delay = $retry_interval * 5;
 $standby->append_conf(
     'postgresql.conf', qq(
-recovery_min_apply_delay = '${apply_delay}ms'
+    recovery_min_apply_delay = '${apply_delay}ms'
 ));
 $standby->start;


Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Tue, Feb 20, 2024 at 11:54 AM Japin Li <japinli@hotmail.com> wrote:
>
>
> On Tue, 20 Feb 2024 at 13:40, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
> > On Mon, Feb 19, 2024 at 8:25 PM Japin Li <japinli@hotmail.com> wrote:
> >> [2]
> >> +# Ensure checkpoint doesn't come in our way
> >> +$primary->append_conf('postgresql.conf', qq(
> >> +    min_wal_size = 2MB
> >> +    max_wal_size = 1GB
> >> +    checkpoint_timeout = 1h
> >> +       autovacuum = off
> >> +));
> >>
> >> Keeping the same indentation might be better.
> >
> > The autovacuum line looks mis-indented in the patch file. However, I
> > now ran src/tools/pgindent/perltidyrc
> > src/test/recovery/t/041_wal_source_switch.pl on it.
> >
>
> Thanks for updating the patch.  It seems still with the wrong indent.

Thanks. perltidyrc didn't complain about anything on v19. However, I
kept the alignment same as other TAP tests for multi-line append_conf.
If that's not correct, I'll leave it to the committer to decide. PSA
v20 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment
cfbot claims that this one needs another rebase.

I've spent some time thinking about this one.  I'll admit I'm a bit worried
about adding more complexity to this state machine, but I also haven't
thought of any other viable approaches, and this still seems like a useful
feature.  So, for now, I think we should continue with the current
approach.

+        fails to switch to stream mode, it falls back to archive mode. If this
+        parameter value is specified without units, it is taken as
+        milliseconds. Default is <literal>5min</literal>. With a lower value

Does this really need to be milliseconds?  I would think that any
reasonable setting would at least on the order of seconds.

+        attempts. To avoid this, it is recommended to set a reasonable value.

I think we might want to suggest what a "reasonable value" is.

+    static bool canSwitchSource = false;
+    bool        switchSource = false;

IIUC "canSwitchSource" indicates that we are trying to force a switch to
streaming, but we are currently exhausting anything that's present in the
pg_wal directory, while "switchSource" indicates that we should force a
switch to streaming right now.  Furthermore, "canSwitchSource" is static
while "switchSource" is not.  Is there any way to simplify this?  For
example, would it be possible to make an enum that tracks the
streaming_replication_retry_interval state?

             /*
              * Don't allow any retry loops to occur during nonblocking
-             * readahead.  Let the caller process everything that has been
-             * decoded already first.
+             * readahead if we failed to read from the current source. Let the
+             * caller process everything that has been decoded already first.
              */
-            if (nonblocking)
+            if (nonblocking && lastSourceFailed)
                 return XLREAD_WOULDBLOCK;

Why do we skip this when "switchSource" is set?

+            /* Reset the WAL source switch state */
+            if (switchSource)
+            {
+                Assert(canSwitchSource);
+                Assert(currentSource == XLOG_FROM_STREAM);
+                Assert(oldSource == XLOG_FROM_ARCHIVE);
+                switchSource = false;
+                canSwitchSource = false;
+            }

How do we know that oldSource is guaranteed to be XLOG_FROM_ARCHIVE?  Is
there no way it could be XLOG_FROM_PG_WAL?

+#streaming_replication_retry_interval = 5min    # time after which standby
+                    # attempts to switch WAL source from archive to
+                    # streaming replication
+                    # in milliseconds; 0 disables

I think we might want to turn this feature off by default, at least for the
first release.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Tue, Mar 5, 2024 at 7:34 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> cfbot claims that this one needs another rebase.

Yeah, the conflict was with the new TAP test file name in
src/test/recovery/meson.build.

> I've spent some time thinking about this one.  I'll admit I'm a bit worried
> about adding more complexity to this state machine, but I also haven't
> thought of any other viable approaches,

Right. I understand that the WaitForWALToBecomeAvailable()'s state
machine is a complex piece.

> and this still seems like a useful
> feature.  So, for now, I think we should continue with the current
> approach.

Yes, the feature is useful like mentioned in the docs as below:

+        Reading WAL from archive may not always be as efficient and fast as
+        reading from primary. This can be due to the differences in disk types,
+        IO costs, network latencies etc. All of these can impact the recovery
+        performance on standby, and can increase the replication lag on
+        primary. In addition, the primary keeps accumulating WAL needed for the
+        standby while the standby reads WAL from archive, because the standby
+        replication slot stays inactive. To avoid these problems, one can use
+        this parameter to make standby switch to stream mode sooner.

> +        fails to switch to stream mode, it falls back to archive mode. If this
> +        parameter value is specified without units, it is taken as
> +        milliseconds. Default is <literal>5min</literal>. With a lower value
>
> Does this really need to be milliseconds?  I would think that any
> reasonable setting would at least on the order of seconds.

Agreed. Done that way.

> +        attempts. To avoid this, it is recommended to set a reasonable value.
>
> I think we might want to suggest what a "reasonable value" is.

It really depends on the WAL generation rate on the primary. If the
WAL files grow faster, the disk runs out of space sooner, so setting a
 value to make frequent WAL source switch attempts can help. It's hard
to suggest a one-size-fits-all value. Therefore, I've tweaked the docs
a bit to reflect the fact  that it depends on the WAL generation rate.

> +       static bool canSwitchSource = false;
> +       bool            switchSource = false;
>
> IIUC "canSwitchSource" indicates that we are trying to force a switch to
> streaming, but we are currently exhausting anything that's present in the
> pg_wal directory,

Right.

> while "switchSource" indicates that we should force a
> switch to streaming right now.

It's not indicating force switch, it says "previously I was asked to
switch source via canSwitchSource, now that I've exhausted all the WAL
from the pg_wal directory, I'll make a source switch attempt now".

> Furthermore, "canSwitchSource" is static
> while "switchSource" is not.

This is because the WaitForWALToBecomeAvailable() has to remember the
decision (that streaming_replication_retry_interval has occurred)
across the calls. And, switchSource is decided within
WaitForWALToBecomeAvailable() for every function call.

> Is there any way to simplify this?  For
> example, would it be possible to make an enum that tracks the
> streaming_replication_retry_interval state?

I guess the way it is right now looks simple IMHO. If the suggestion
is to have an enum like below; it looks overkill for just two states.

typedef enum
{
    CAN_SWITCH_SOURCE,
    SWITCH_SOURCE
} XLogSourceSwitchState;

>                         /*
>                          * Don't allow any retry loops to occur during nonblocking
> -                        * readahead.  Let the caller process everything that has been
> -                        * decoded already first.
> +                        * readahead if we failed to read from the current source. Let the
> +                        * caller process everything that has been decoded already first.
>                          */
> -                       if (nonblocking)
> +                       if (nonblocking && lastSourceFailed)
>                                 return XLREAD_WOULDBLOCK;
>
> Why do we skip this when "switchSource" is set?

It was leftover from the initial version of the patch - I was then
encountering some issue and had that piece there. Removed it now.

> +                       /* Reset the WAL source switch state */
> +                       if (switchSource)
> +                       {
> +                               Assert(canSwitchSource);
> +                               Assert(currentSource == XLOG_FROM_STREAM);
> +                               Assert(oldSource == XLOG_FROM_ARCHIVE);
> +                               switchSource = false;
> +                               canSwitchSource = false;
> +                       }
>
> How do we know that oldSource is guaranteed to be XLOG_FROM_ARCHIVE?  Is
> there no way it could be XLOG_FROM_PG_WAL?

No. switchSource is set to true only when canSwitchSource is set to
true, which happens only when currentSource is XLOG_FROM_ARCHIVE (see
SwitchWALSourceToPrimary()).

> +#streaming_replication_retry_interval = 5min   # time after which standby
> +                                       # attempts to switch WAL source from archive to
> +                                       # streaming replication
> +                                       # in milliseconds; 0 disables
>
> I think we might want to turn this feature off by default, at least for the
> first release.

Agreed. Done that way.

Please see the attached v21 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment
On Tue, Mar 05, 2024 at 11:38:37PM +0530, Bharath Rupireddy wrote:
> On Tue, Mar 5, 2024 at 7:34 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> Is there any way to simplify this?  For
>> example, would it be possible to make an enum that tracks the
>> streaming_replication_retry_interval state?
> 
> I guess the way it is right now looks simple IMHO. If the suggestion
> is to have an enum like below; it looks overkill for just two states.
> 
> typedef enum
> {
>     CAN_SWITCH_SOURCE,
>     SWITCH_SOURCE
> } XLogSourceSwitchState;

I was thinking of something more like

    typedef enum
    {
        NO_FORCE_SWITCH_TO_STREAMING,        /* no switch necessary */
        FORCE_SWITCH_TO_STREAMING_PENDING,    /* exhausting pg_wal */
        FORCE_SWITCH_TO_STREAMING,            /* switch to streaming now */
    } WALSourceSwitchState;

At least, that illustrates my mental model of the process here.  IMHO
that's easier to follow than two similarly-named bool variables.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Wed, Mar 6, 2024 at 1:22 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> I was thinking of something more like
>
>         typedef enum
>         {
>                 NO_FORCE_SWITCH_TO_STREAMING,           /* no switch necessary */
>                 FORCE_SWITCH_TO_STREAMING_PENDING,      /* exhausting pg_wal */
>                 FORCE_SWITCH_TO_STREAMING,                      /* switch to streaming now */
>         } WALSourceSwitchState;
>
> At least, that illustrates my mental model of the process here.  IMHO
> that's easier to follow than two similarly-named bool variables.

I played with that idea and it came out very nice. Please see the
attached v22 patch. Note that personally I didn't like "FORCE" being
there in the names, so I've simplified them a bit.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment
On Wed, Mar 06, 2024 at 10:02:43AM +0530, Bharath Rupireddy wrote:
> On Wed, Mar 6, 2024 at 1:22 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> I was thinking of something more like
>>
>>         typedef enum
>>         {
>>                 NO_FORCE_SWITCH_TO_STREAMING,           /* no switch necessary */
>>                 FORCE_SWITCH_TO_STREAMING_PENDING,      /* exhausting pg_wal */
>>                 FORCE_SWITCH_TO_STREAMING,                      /* switch to streaming now */
>>         } WALSourceSwitchState;
>>
>> At least, that illustrates my mental model of the process here.  IMHO
>> that's easier to follow than two similarly-named bool variables.
> 
> I played with that idea and it came out very nice. Please see the
> attached v22 patch. Note that personally I didn't like "FORCE" being
> there in the names, so I've simplified them a bit.

Thanks.  I'd like to spend some time testing this, but from a glance, the
code appears to be in decent shape.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Wed, Mar 6, 2024 at 9:49 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> > I played with that idea and it came out very nice. Please see the
> > attached v22 patch. Note that personally I didn't like "FORCE" being
> > there in the names, so I've simplified them a bit.
>
> Thanks.  I'd like to spend some time testing this, but from a glance, the
> code appears to be in decent shape.

Rebase needed after 071e3ad59d6fd2d6d1277b2bd9579397d10ded28 due to a
conflict in meson.build. Please see the attached v23 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment
On Sun, Mar 17, 2024 at 11:37:58AM +0530, Bharath Rupireddy wrote:
> Rebase needed after 071e3ad59d6fd2d6d1277b2bd9579397d10ded28 due to a
> conflict in meson.build. Please see the attached v23 patch.

I've been reading this patch, and this is a very tricky one.  Please
be *very* cautious.

+#streaming_replication_retry_interval = 0    # time after which standby
+                    # attempts to switch WAL source from archive to
+                    # streaming replication in seconds; 0 disables

This stuff allows a minimal retry interval of 1s.  Could it be useful
to have more responsiveness here and allow lower values than that?
Why not switching the units to be milliseconds?

+    if (streaming_replication_retry_interval <= 0 ||
+        !StandbyMode ||
+        currentSource != XLOG_FROM_ARCHIVE)
+        return SWITCH_TO_STREAMING_NONE;

Hmm.  Perhaps this should mention why we don't care about the
consistent point.

+                /* See if we can switch WAL source to streaming */
+                if (wal_source_switch_state == SWITCH_TO_STREAMING_NONE)
+                    wal_source_switch_state = SwitchWALSourceToPrimary();

Rather than a routine that returns as result the value to use for the
GUC, I'd suggest to let this routine set the GUC as there is only one
caller of SwitchWALSourceToPrimary().  This can also include a check
on SWITCH_TO_STREAMING_NONE, based on what I'm reading that.

-       if (lastSourceFailed)
+       if (lastSourceFailed ||
+           wal_source_switch_state == SWITCH_TO_STREAMING)

Hmm.  This one may be tricky.  I'd recommend a separation between the
failure in reading from a source and the switch to a new "forced"
source.

+                if (wal_source_switch_state == SWITCH_TO_STREAMING_PENDING)
+                    readFrom = XLOG_FROM_PG_WAL;
+                else
+                    readFrom = currentSource == XLOG_FROM_ARCHIVE ?
+                        XLOG_FROM_ANY : currentSource;

WALSourceSwitchState looks confusing here, and are you sure that this
is actualy correct?  Shouldn't we still try a READ_FROM_ANY or a read
from the archives even with a streaming pending.

By the way, I am not convinced that what you have is the best
interface ever.  This assumes that we'd always want to switch to
streaming more aggressively.  Could there be a point in also
controlling if we should switch to pg_wal/ or just to archiving more
aggressively as well, aka be able to do the opposite switch of WAL
source?  This design looks somewhat limited to me.  The origin of the
issue is that we don't have a way to control the order of the sources
consumed by WAL replay.  Perhaps something like a replay_source_order
that uses a list would be better, with elements settable to archive,
pg_wal and streaming?
--
Michael

Attachment

Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
On Mon, Mar 18, 2024 at 11:38 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Sun, Mar 17, 2024 at 11:37:58AM +0530, Bharath Rupireddy wrote:
> > Rebase needed after 071e3ad59d6fd2d6d1277b2bd9579397d10ded28 due to a
> > conflict in meson.build. Please see the attached v23 patch.
>
> I've been reading this patch, and this is a very tricky one.  Please
> be *very* cautious.

Thanks for looking into this.

> +#streaming_replication_retry_interval = 0    # time after which standby
> +                    # attempts to switch WAL source from archive to
> +                    # streaming replication in seconds; 0 disables
>
> This stuff allows a minimal retry interval of 1s.  Could it be useful
> to have more responsiveness here and allow lower values than that?
> Why not switching the units to be milliseconds?

Nathan had a different view on this to have it on the order of seconds
- https://www.postgresql.org/message-id/20240305020452.GA3373526%40nathanxps13.
If set to a too low value, the frequency of standby trying to connect
to primary increases. IMO, the order of seconds seems fine.

> +    if (streaming_replication_retry_interval <= 0 ||
> +        !StandbyMode ||
> +        currentSource != XLOG_FROM_ARCHIVE)
> +        return SWITCH_TO_STREAMING_NONE;
>
> Hmm.  Perhaps this should mention why we don't care about the
> consistent point.

Are you asking why we don't care whether the standby reached a
consistent point when switching to streaming mode due to this new
parameter? If this is the ask, the same applies when a standby
typically switches to streaming replication (get WAL
from primary) today, that is when receive from WAL archive finishes
(no more WAL left there) or fails for any reason. The standby doesn't
care about the consistent point even today, it just trusts the WAL
source and makes the switch.

> +                /* See if we can switch WAL source to streaming */
> +                if (wal_source_switch_state == SWITCH_TO_STREAMING_NONE)
> +                    wal_source_switch_state = SwitchWALSourceToPrimary();
>
> Rather than a routine that returns as result the value to use for the
> GUC, I'd suggest to let this routine set the GUC as there is only one
> caller of SwitchWALSourceToPrimary().  This can also include a check
> on SWITCH_TO_STREAMING_NONE, based on what I'm reading that.

Firstly, wal_source_switch_state is not a GUC, it's a static variable
to be used across WaitForWALToBecomeAvailable calls. And, if you are
suggesting to turn SwitchWALSourceToPrimary so that it sets
wal_source_switch_state directly, I'd not do that because when
wal_source_switch_state is not SWITCH_TO_STREAMING_NONE, the function
gets called unnecessarily.

> -       if (lastSourceFailed)
> +       if (lastSourceFailed ||
> +           wal_source_switch_state == SWITCH_TO_STREAMING)
>
> Hmm.  This one may be tricky.  I'd recommend a separation between the
> failure in reading from a source and the switch to a new "forced"
> source.

Separation would just add duplicate code. Moreover, the code wrapped
within if (lastSourceFailed) doesn't do any error handling or such, it
just resets a few stuff from the previous source and sets the next
source.

FWIW, please check [1] (and the discussion thereon) for how the
lastSourceFailed flag is being used to consume all the streamed WAL in
pg_wal directly upon detecting promotion trigger file. Therefore, I
see no problem with the way it is right now for this new feature.

[1]
                    /*
                     * Data not here yet. Check for trigger, then wait for
                     * walreceiver to wake us up when new WAL arrives.
                     */
                    if (CheckForStandbyTrigger())
                    {
                        /*
                         * Note that we don't return XLREAD_FAIL immediately
                         * here. After being triggered, we still want to
                         * replay all the WAL that was already streamed. It's
                         * in pg_wal now, so we just treat this as a failure,
                         * and the state machine will move on to replay the
                         * streamed WAL from pg_wal, and then recheck the
                         * trigger and exit replay.
                         */
                        lastSourceFailed = true;

> +                if (wal_source_switch_state == SWITCH_TO_STREAMING_PENDING)
> +                    readFrom = XLOG_FROM_PG_WAL;
> +                else
> +                    readFrom = currentSource == XLOG_FROM_ARCHIVE ?
> +                        XLOG_FROM_ANY : currentSource;
>
> WALSourceSwitchState looks confusing here, and are you sure that this
> is actualy correct?  Shouldn't we still try a READ_FROM_ANY or a read
> from the archives even with a streaming pending.

Please see the discussion starting from
https://www.postgresql.org/message-id/20221008215221.GA894639%40nathanxps13.
We wanted to keep the existing behaviour the same when we
intentionally switch source to streaming from archive due to the
timeout. The existing behaviour is to exhaust WAL in pg_wal before
switching the source to streaming after failure to fetch from archive.

When wal_source_switch_state is SWITCH_TO_STREAMING_PENDING, the
currentSource is already XLOG_FROM_ARCHIVE (To clear the dust off
here, I've added an assert now in the attached new v24 patch). And, we
don't want to pass XLOG_FROM_ANY to XLogFileReadAnyTLI to again fetch
from the archive. Hence, we choose readFrom = XLOG_FROM_PG_WAL to
specifically tell XLogFileReadAnyTLI read from pg_wal directly.

> By the way, I am not convinced that what you have is the best
> interface ever.  This assumes that we'd always want to switch to
> streaming more aggressively.  Could there be a point in also
> controlling if we should switch to pg_wal/ or just to archiving more
> aggressively as well, aka be able to do the opposite switch of WAL
> source?  This design looks somewhat limited to me.  The origin of the
> issue is that we don't have a way to control the order of the sources
> consumed by WAL replay.  Perhaps something like a replay_source_order
> that uses a list would be better, with elements settable to archive,
> pg_wal and streaming?

Intention of this feature is to provide a way for the streaming
standby to quickly detect when the primary is up and running without
having to wait until either all the WAL in the archive is over or a
failure to fetch from archive happens. Advantages of this feature are:
1) it can make the recovery a bit faster (if fetching from archive
adds up costs with different storage types, IO costs and network
delays), thus can reduce the replication lag 2) primary (if using
replication slot based streaming replication setup) doesn't have to
keep the required WAL for the standby for longer durations, thus
reducing the risk of no space left on disk issues.

IMHO, it makes sense to have something like replay_source_order if
there's any use case that arises in future requiring the standby to
intentionally switch to pg_wal or archive. But not as part of this
feature.

Please see the attached v24 patch. I've added an assertion that the
current source is archive before calling XLogFileReadAnyTLI if
wal_source_switch_state is SWITCH_TO_STREAMING_PENDING. I've also
added the new enum WALSourceSwitchState to typedefs.list to make
pgindent adjust it correctly.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment
Hi,

I took a brief look at the patch.

For a motivation aspect I can see this being useful
synchronous_replicas if you have commit set to flush mode.
So +1 on feature, easier configurability, although thinking about it
more you could probably have the restore script be smarter and provide
non-zero exit codes periodically.

The patch needs to be rebased but I tested this against an older 17 build.

> + ereport(DEBUG1,
> + errmsg_internal("switched WAL source from %s to %s after %s",
> + xlogSourceNames[oldSource],

Not sure if you're intentionally changing from DEBUG1 from DEBUG2.

> * standby and increase the replication lag on primary.

Do you mean "increase replication lag on standby"?
nit: reading from archive *could* be faster since you could in theory
it's not single-processed/threaded.

> However,
> + * exhaust all the WAL present in pg_wal before switching. If successful,
> + * the state machine moves to XLOG_FROM_STREAM state, otherwise it falls
> + * back to XLOG_FROM_ARCHIVE state.

I think I'm missing how this happens. Or what "successful" means. If I'm reading
it right, no matter what happens we will always move to
XLOG_FROM_STREAM based on how
the state machine works?

I tested this in a basic RR setup without replication slots (e.g. log
shipping) where the
WAL is available in the archive but the primary always has the WAL
rotated out and
'streaming_replication_retry_interval = 1'. This leads the RR to
become stuck where it stops fetching from
archive and loops between XLOG_FROM_PG_WAL and XLOG_FROM_STREAM.

When 'streaming_replication_retry_interval' is breached, we transition
from {currentSource, wal_source_switch_state}

{XLOG_FROM_ARCHIVE, SWITCH_TO_STREAMING_NONE} -> {XLOG_FROM_ARCHIVE,
SWITCH_TO_STREAMING_PENDING} with readFrom = XLOG_FROM_PG_WAL.

That reads the last record successfully in pg_wal and then fails to
read the next one because it doesn't exist, transitioning to

{XLOG_FROM_STREAM, SWITCH_TO_STREAMING_PENDING}.

XLOG_FROM_STREAM fails because the WAL is no longer there on primary,
it sets it back to {XLOG_FROM_ARCHIVE, SWITCH_TO_STREAMING_PENDING}.

> last_fail_time = now;
> currentSource = XLOG_FROM_ARCHIVE;
> break;

Since the state is still SWITCH_TO_STREAMING_PENDING from the previous
loops, it forces

>                   Assert(currentSource == XLOG_FROM_ARCHIVE);
>                   readFrom = XLOG_FROM_PG_WAL;
>                   ...
>                   readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2, readFrom);

And this readFile call seems to always succeed since it can read the
latest WAL record but not the next one, which is in archive, leading
to transition back to XLOG_FROM_STREAMING and repeats.


>  /*
>  * Nope, not found in archive or pg_wal.
> */
> lastSourceFailed = true;

I don't think this gets triggered for XLOG_FROM_PG_WAL case, which
means the safety
check you added doesn't actually kick in.

>               if (wal_source_switch_state == SWITCH_TO_STREAMING_PENDING)
>               {
>                     wal_source_switch_state = SWITCH_TO_STREAMING;
>                     elog(LOG, "SWITCH_TO_STREAMING_PENDING TO SWITCH_TO_STREAMING");
>                }

Thanks
-- 
John Hsu - Amazon Web Services



Re: Switching XLog source from archive to streaming when primary available

From
Bharath Rupireddy
Date:
Hi,

Thanks for looking into this.

On Fri, Aug 23, 2024 at 5:03 AM John H <johnhyvr@gmail.com> wrote:
>
> For a motivation aspect I can see this being useful
> synchronous_replicas if you have commit set to flush mode.

In synchronous replication setup, until standby finishes fetching WAL
from the archive, the commits on the primary have to wait which can
increase the query latency. If the standby can connect to the primary
as soon as the broken connection is restored, it can fetch the WAL
soon and transaction commits can continue on the primary. Is my
understanding correct? Is there anything more to this?

I talked to Michael Paquier at PGConf.Dev 2024 and got some concerns
about this feature for dealing with changing timelines. I can't think
of them right now.

And, there were some cautions raised upthread -
https://www.postgresql.org/message-id/20240305020452.GA3373526%40nathanxps13
and https://www.postgresql.org/message-id/ZffaQt7UbM2Q9kYh%40paquier.xyz.

> So +1 on feature, easier configurability, although thinking about it
> more you could probably have the restore script be smarter and provide
> non-zero exit codes periodically.

Interesting. Yes, the restore script has to be smarter to detect the
broken connections and distinguish whether the server is performing
just the archive recovery/PITR or streaming from standby. Not doing it
right, perhaps, can cause data loss (?).

> The patch needs to be rebased but I tested this against an older 17 build.

Will rebase soon.

> > + ereport(DEBUG1,
> > + errmsg_internal("switched WAL source from %s to %s after %s",
> > + xlogSourceNames[oldSource],
>
> Not sure if you're intentionally changing from DEBUG1 from DEBUG2.

Will change.

> > * standby and increase the replication lag on primary.
>
> Do you mean "increase replication lag on standby"?
> nit: reading from archive *could* be faster since you could in theory
> it's not single-processed/threaded.

Yes. I think we can just say "All of these can impact the recovery
performance on
+ * standby and increase the replication lag."

> > However,
> > + * exhaust all the WAL present in pg_wal before switching. If successful,
> > + * the state machine moves to XLOG_FROM_STREAM state, otherwise it falls
> > + * back to XLOG_FROM_ARCHIVE state.
>
> I think I'm missing how this happens. Or what "successful" means. If I'm reading
> it right, no matter what happens we will always move to
> XLOG_FROM_STREAM based on how
> the state machine works?

Please have a look at some discussion upthread on exhausting pg_wal
before switching -
https://www.postgresql.org/message-id/20230119005014.GA3838170%40nathanxps13.
Even today, the standby exhausts pg_wal before switching to streaming
from the archive.

> I tested this in a basic RR setup without replication slots (e.g. log
> shipping) where the
> WAL is available in the archive but the primary always has the WAL
> rotated out and
> 'streaming_replication_retry_interval = 1'. This leads the RR to
> become stuck where it stops fetching from
> archive and loops between XLOG_FROM_PG_WAL and XLOG_FROM_STREAM.

Nice catch. This is a problem. One idea is to disable
streaming_replication_retry_interval feature for slot-less streaming
replication - either when primary_slot_name isn't specified disallow
the GUC to be set in assign_hook or when deciding to switch the wal
source. Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Hi,

On Thu, Aug 29, 2024 at 6:32 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> In synchronous replication setup, until standby finishes fetching WAL
> from the archive, the commits on the primary have to wait which can
> increase the query latency. If the standby can connect to the primary
> as soon as the broken connection is restored, it can fetch the WAL
> soon and transaction commits can continue on the primary. Is my
> understanding correct? Is there anything more to this?
>

Yup, if you're running with synchronous_commit = 'on' with
synchronous_replicas, then you can
have the replica continue streaming changes into pg_wal faster than
WAL replay so commits
may be unblocked faster.

> I talked to Michael Paquier at PGConf.Dev 2024 and got some concerns
> about this feature for dealing with changing timelines. I can't think
> of them right now.

I'm not sure what the risk would be if the WAL/history files we sync
from streaming is the same as
we replay from archive.


> And, there were some cautions raised upthread -
> https://www.postgresql.org/message-id/20240305020452.GA3373526%40nathanxps13
> and https://www.postgresql.org/message-id/ZffaQt7UbM2Q9kYh%40paquier.xyz.

Yup agreed. I need to understand this area a lot better before I can
do a more in-depth review.

> Interesting. Yes, the restore script has to be smarter to detect the
> broken connections and distinguish whether the server is performing
> just the archive recovery/PITR or streaming from standby. Not doing it
> right, perhaps, can cause data loss (?).

I don't think there would be data-loss, only replay is stuck/slows down.
It wouldn't be any different today if the restore-script returned a
non-zero exit status.
The end-user could configure their restore-script to return a non-zero
status, based on some
condition, to move to streaming.

> > > However,
> > > + * exhaust all the WAL present in pg_wal before switching. If successful,
> > > + * the state machine moves to XLOG_FROM_STREAM state, otherwise it falls
> > > + * back to XLOG_FROM_ARCHIVE state.
> >
> > I think I'm missing how this happens. Or what "successful" means. If I'm reading
> > it right, no matter what happens we will always move to
> > XLOG_FROM_STREAM based on how
> > the state machine works?
>
> Please have a look at some discussion upthread on exhausting pg_wal
> before switching -
> https://www.postgresql.org/message-id/20230119005014.GA3838170%40nathanxps13.
> Even today, the standby exhausts pg_wal before switching to streaming
> from the archive.
>

I'm getting caught on the word "successful". My rough understanding of
WaitForWALToBecomeAvailable is that once you're in XLOG_FROM_PG_WAL, if it was
unsuccessful for whatever reason, it will still transition to
XLOG_FROM_STREAMING.
It does not loop back to XLOG_FROM_ARCHIVE if XLOG_FROM_PG_WAL fails.

> Nice catch. This is a problem. One idea is to disable
> streaming_replication_retry_interval feature for slot-less streaming
> replication - either when primary_slot_name isn't specified disallow
> the GUC to be set in assign_hook or when deciding to switch the wal
> source. Thoughts?

I don't think it's dependent on slot-less streaming. You would also run into the
issue if the WAL is no longer there on the primary, which can occur
with 'max_slot_wal_keep_size'
as well.
IMO the guarantee we need to make is that when we transition from
XLOG_FROM_STREAMING to
XLOG_FROM_ARCHIVE for a "fresh start", we should attempt to restore
from archive at least once.
I think this means that wal_source_switch_state should be reset back
to SWITCH_TO_STREAMING_NONE
whenever we transition to XLOG_FROM_ARCHIVE.
We've attempted the switch to streaming once, so let's not continually
re-try if it failed.

Thanks,

--
John Hsu - Amazon Web Services