On Mon, Apr 29, 2024, at 6:56 AM, Amit Kapila wrote:
>
> On Tue, Mar 26, 2024, at 4:12 PM, Tomas Vondra wrote:
>
> Perhaps I'm missing something, but why is NUM_CONN_ATTEMPTS even needed?
> Why isn't recovery_timeout enough to decide if wait_for_end_recovery()
> waited long enough?
>
>
> It was an attempt to decoupled a connection failure (that keeps streaming the
> WAL) from recovery timeout. The NUM_CONN_ATTEMPTS guarantees that if the primary
> is gone during the standby recovery process, there is a way to bail out.
>
I think we don't need to check primary if the WAL corresponding to
consistent_lsn is already present on the standby. Shouldn't we first
check that? Once we ensure that the required WAL is copied, just
checking server_is_in_recovery() should be sufficient. I feel that
will be a direct way of ensuring what is required rather than
indirectly verifying the same (by checking pg_stat_wal_receiver) as we
are doing currently.
How would you check it? WAL file? During recovery, you are not allowed to use
pg_current_wal_lsn.
Tomas suggested to me off-list that we should adopt a simple solution in
wait_for_end_recovery: wait for recovery_timeout without additional checks
(which means remove the pg_stat_wal_receiver logic). When we have additional
information that we can reliably use in this function, we can add it. Hence, it
is also easy to adjust the PG_TEST_TIMEOUT_DEFAULT to have stable tests.