Re: add retry mechanism for achieving recovery target before emitting FATA error "recovery ended before configured recovery target was reached" - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: add retry mechanism for achieving recovery target before emitting FATA error "recovery ended before configured recovery target was reached"
Date
Msg-id CALj2ACXbkQE=s+mccU=4Rcg3vgTQ4QfDNsnWN=wgMHodC-FNfQ@mail.gmail.com
Whole thread Raw
In response to Re: add retry mechanism for achieving recovery target before emitting FATA error "recovery ended before configured recovery target was reached"  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: add retry mechanism for achieving recovery target before emitting FATA error "recovery ended before configured recovery target was reached"  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
On Fri, Oct 22, 2021 at 5:54 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Wed, 2021-10-20 at 21:35 +0530, Bharath Rupireddy wrote:
> > The  FATAL error "recovery ended before configured recovery target
> > was
> > reached" introduced by commit at [1] in PG 14 is causing the standby
> > to go down after having spent a good amount of time in recovery.
> > There
> > can be cases where the arrival of required WAL (for reaching recovery
> > target) from the archive location to the standby may take time and
> > meanwhile the standby failing with the FATAL error isn't good.
> > Instead, how about we make the standby wait for a certain amount of
> > time (with a GUC) so that it can keep looking for the required WAL.
>
> How is archiving configured, and would it be possible to introduce
> logic into the restore_command to handle slow-to-arrive WAL?

Thanks Jeff!

If the suggestion is to have the wait and retry logic embedded into
the user-written restore_command, IMHO, it's not a good idea as the
restore_command is external to the core PG and the FATAL error
"recovery ended before configured recovery target was reached" is an
internal thing. Having the retry logic (controlled with a GUC) within
the core, when the startup process hits the recovery end before the
target, is a better way and it is something the core PG can offer.
With this, the amount  of work spent in recovery by the standby isn't
wasted if the GUC is enabled with the right value. The optimal value
someone can set is the average time it takes for the WAL to reach
archive location from the primary + from archive location to the
standby. By default, we can disable the new GUC with value 0 so that
whoever wants can set it.

Regards,
Bharath Rupireddy.



pgsql-hackers by date:

Previous
From: Nitin Jadhav
Date:
Subject: Re: Multi-Column List Partitioning
Next
From: Bharath Rupireddy
Date:
Subject: Re: logical decoding/replication: new functions pg_ls_logicaldir and pg_ls_replslotdir