Hi,
The FATAL error "recovery ended before configured recovery target was
reached" introduced by commit at [1] in PG 14 is causing the standby
to go down after having spent a good amount of time in recovery. There
can be cases where the arrival of required WAL (for reaching recovery
target) from the archive location to the standby may take time and
meanwhile the standby failing with the FATAL error isn't good.
Instead, how about we make the standby wait for a certain amount of
time (with a GUC) so that it can keep looking for the required WAL. If
it gets the required WAL during the wait time, then it succeeds in
reaching the recovery target (no FATAL error of course). If it
doesn't, the timeout occurs and the standby fails with the FATAL
error. The value of the new GUC can probably be set to the average
time it takes for the WAL to reach archive location from the primary +
from archive location to the standby, default 0 i.e. disabled.
I'm attaching a WIP patch. I've tested it on my dev system and the
recovery regression tests are passing with it. I will provide a better
version later, probably with a test case.
Thoughts?
[1] commit dc788668bb269b10a108e87d14fefd1b9301b793
Author: Peter Eisentraut <peter@eisentraut.org>
Date: Wed Jan 29 15:43:32 2020 +0100
Fail if recovery target is not reached
Before, if a recovery target is configured, but the archive ended
before the target was reached, recovery would end and the server would
promote without further notice. That was deemed to be pretty wrong.
With this change, if the recovery target is not reached, it is a fatal
error.
Based-on-patch-by: Leif Gunnar Erlandsen <leif@lako.no>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion:
https://www.postgresql.org/message-id/flat/993736dd3f1713ec1f63fc3b653839f5@lako.no
Regards,
Bharath Rupireddy.