On Fri, 2021-10-22 at 15:34 +0530, Bharath Rupireddy wrote:
> If the suggestion is to have the wait and retry logic embedded into
> the user-written restore_command, IMHO, it's not a good idea as the
> restore_command is external to the core PG and the FATAL error
> "recovery ended before configured recovery target was reached" is an
> internal thing.
It seems likely that you'd want to tweak the exact behavior for the
given system. For instance, if the files are making some progress, and
you can estimate that in 2 more minutes everything will be fine, then
you may be more willing to wait those two minutes. But if no progress
has happened since recovery began 15 minutes ago, you may want to fail
immediately.
All of this nuance would be better captured in a specialized script
than a generic timeout in the server code.
What do you want to do after the timeout happens? If you want to issue
a WARNING instead of failing outright, perhaps that makes sense for
exploratory PITR cases. That could be a simple boolean GUC without
needing to introduce the timeout logic into the server.
I think it's an interesting point that it can be hard to choose a
reasonable recovery target if the system is completely down. We could
use some better tooling or metadata around the lsns, xids or timestamp
ranges available in a pg_wal directory or an archive. Even better would
be to see the available named restore points. This would make is easier
to calculate how long recovery might take for a given restore point, or
whether it's not going to work at all because there's not enough WAL.
Regards,
Jeff Davis