Re: add retry mechanism for achieving recovery target before emitting FATA error "recovery ended before configured recovery target was reached" - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: add retry mechanism for achieving recovery target before emitting FATA error "recovery ended before configured recovery target was reached"
Date
Msg-id b334d61396e6b0657a63dc38e16d429703fe9b96.camel@j-davis.com
Whole thread Raw
In response to Re: add retry mechanism for achieving recovery target before emitting FATA error "recovery ended before configured recovery target was reached"  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Responses Re: add retry mechanism for achieving recovery target before emitting FATA error "recovery ended before configured recovery target was reached"  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Re: add retry mechanism for achieving recovery target before emitting FATA error "recovery ended before configured recovery target was reached"  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
List pgsql-hackers
On Fri, 2021-10-22 at 15:34 +0530, Bharath Rupireddy wrote:
> If the suggestion is to have the wait and retry logic embedded into
> the user-written restore_command, IMHO, it's not a good idea as the
> restore_command is external to the core PG and the FATAL error
> "recovery ended before configured recovery target was reached" is an
> internal thing. 

It seems likely that you'd want to tweak the exact behavior for the
given system. For instance, if the files are making some progress, and
you can estimate that in 2 more minutes everything will be fine, then
you may be more willing to wait those two minutes. But if no progress
has happened since recovery began 15 minutes ago, you may want to fail
immediately.

All of this nuance would be better captured in a specialized script
than a generic timeout in the server code.

What do you want to do after the timeout happens? If you want to issue
a WARNING instead of failing outright, perhaps that makes sense for
exploratory PITR cases. That could be a simple boolean GUC without
needing to introduce the timeout logic into the server.

I think it's an interesting point that it can be hard to choose a
reasonable recovery target if the system is completely down. We could
use some better tooling or metadata around the lsns, xids or timestamp
ranges available in a pg_wal directory or an archive. Even better would
be to see the available named restore points. This would make is easier
to calculate how long recovery might take for a given restore point, or
whether it's not going to work at all because there's not enough WAL.

Regards,
    Jeff Davis





pgsql-hackers by date:

Previous
From: "Bossart, Nathan"
Date:
Subject: Re: pg_dump handling of ALTER DEFAULT PRIVILEGES IN SCHEMA
Next
From: Tom Lane
Date:
Subject: Re: Experimenting with hash tables inside pg_dump